CN102496366A

CN102496366A - Speaker identification method irrelevant with text

Info

Publication number: CN102496366A
Application number: CN2011104283792A
Authority: CN
Inventors: 朱坚民; 黄之文; 李孝茹; 李海伟; 王军; 翟东婷; 毛得吉
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2011-12-20
Filing date: 2011-12-20
Publication date: 2012-06-13
Anticipated expiration: 2031-12-20
Also published as: CN102496366B

Abstract

The invention relates to a speaker identification method irrelevant with a text. The method mainly comprises the following steps: (1) acquiring a speaker voice signal and processing the voice signal so as to obtain a voice pretreatment signal; (2) carrying out feature extraction to the voice signal obtained after the pretreatment so as to obtain a characteristic parameter of a speaker in the identification system; (3) repeating the above two steps several times, obtaining a characteristic parameter sequence of the registered speaker and establishing a characteristic parameter reference database of all the registered speakers; (4) acquiring the characteristic parameter sequence of the speaker to be identified and calculating a weighted grey correlation degree between the speaker to be identified and all the registered speakers; (5) extracting a maximum value of all the weighted grey correlation degrees, comparing a weighted grey correlation degree identification threshold with the maximum value so as to obtain a identification result. The invention relates to a biological characteristic identification technology field and especially relates to the speaker identification technology field. In the current speaker identification technology which is irrelevant with the text, an error rate is high. By using the method of the invention, the above problem can be solved. A wide application prospect can be possessed.

Description

A kind of method for distinguishing speek person of and text-independent

Technical field

The present invention relates to biometrics identification technology, mainly is a kind of based on third-octave and grey association and method for distinguishing speek person text-independent.

Technical background

The raising of the development of Along with computer technology and social informatization degree, utilizing people's biological characteristic (like fingerprint, vocal print, image etc.) to carry out identification or checking has become in the information industry very important advanced technology.Speaker Identification is meant that the pronunciation that utilizes the people carries out the identification or the checking of speaker ' s identity, and Speaker Identification can be widely used in fields such as police and judicial department, business transaction, bank finance, conservative individual's secret, safety inspection.

The research emphasis in speaker Recognition Technology field is the extraction of characteristic parameter and the structure of recognizer.Feature extraction is exactly from speaker's voice signal, to extract the characteristic parameter that can at large, accurately express its voice.At present; The characteristic parameter that uses in the speech recognition technology is based on LPCC (the Linear Prediction Cepstrum Coefficient) parameter of channel model, based on MFCC (Mel Frequency Cepstmm Coefficient) parameter or its improvement and the combination of hearing mechanism, but the voice messaging quantity not sufficient that these characteristic parameters characterized.Therefore the present invention proposes to adopt the third-octave Spectral Analysis Method voice signal to be carried out the extraction of characteristic parameter.The whole audio range of 20HZ-20KHZ that the third-octave Spectral Analysis Method can be heard people's ear is divided into the frequency band of 30 constant bandwidth ratios; And the sound signal that drops in these frequency bands carried out spectrum analysis; Can express the information that is contained in speaker's the voice signal more accurately, and then strengthen the robustness of speaker characteristic parameter.

In voice technology research and application, the recognizer of voice signal has three kinds: based on the method for the method of channel model and voice knowledge, template matches and utilize Artificial Neural Network model.Though the research based on channel model and voice knowledge method starts to walk early, because it is too complicated, present stage is not obtained good practical function.The method of template matches has dynamic time warping (DTW), theoretical, vector quantization (VQ) technology of hidden Markov (HMM), and these algorithms are poor anti jamming capability under noise circumstance, can not reach good identification effect.Artificial Neural Network has adaptivity, concurrency, robustness, fault-tolerance and learning characteristic; Classification capacity that it is powerful and input-output mapping ability are all very attractive in speech recognition; But, can not obtain good practical function owing to have training, the oversize shortcoming of recognition time.The present invention proposes to use the method based on the grey association degree to carry out Speaker Identification, considers the information and the effect of information change in Speaker Identification thereof of containing in speaker's voice signal simultaneously, has improved the discrimination of voice signal significantly.

Speaker Identification can be divided into relevant with text again and with text-independent, and this two all be to carry out Speaker Identification according to the characteristic information that contains in the voice signal." relevant with text " is to adopt the restricted content of text of speaking, and only one or more characteristic parameters in speaker's the voice signal discerned that be easier to palmed off imitation, the confidentiality of recognition system is not high." with text-independent " then is to adopt the content of text of speaking at random, and the dirigibility of recognition system is good.But since in the voice signal contain the rich of characteristic information, and the complicacy of noise in the actual environment, the step of traditional method for distinguishing speek person is more loaded down with trivial details again.

Summary of the invention

In order to solve the Speaker Identification rate of the existing defective of above-mentioned technology and raising and text-independent, it is a kind of based on third-octave and grey association and method for distinguishing speek person text-independent that the present invention provides.This method is carried out feature extraction through the third-octave Spectral Analysis Method to speaker's voice signal; And adopt grey association degree algorithm to carry out Speaker Identification; Be a kind of reliably, effectively and the method for distinguishing speek person of text-independent, have good robustness.

For reaching above goal of the invention, the inventive method comprises the steps:

One, set up N speaker's phonetic feature reference library, described N is the integer more than or equal to 1, and step is following:

A, gather the 1st section voice signal of the 1st speaker and successively sample quantization, zero-suppress float, pre-emphasis and windowing, obtain the 1-1 audio frame F after the windowing _m' (n);

B, to 1-1 audio frame F _m' (n) use the third-octave Spectral Analysis Method, obtain the 1-1 characteristic parameter, described characteristic parameter is the corresponding power spectral value sequence of each centre frequency frequency band of living in, described 1-1 representes the 1st section voice signal of the 1st speaker;

C, a N speaker carry out M A, B step, obtain N * M characteristic parameter successively, and described N * M characteristic parameter forms the characteristic parameter reference library, and described N * M representes M characteristic parameter extraction of N speaker;

Two, obtain N grey association degree, step is following:

I, gather speaker characteristic parameter X to be measured through steps A, B;

II, add to the sequence of characteristic parameter X in the reference library respectively; And the sequence of giving N characteristic parameter equably according to the timeinvariance of frequency-region signal is with identical weight coefficient; Reconfigure N weighted mean characteristic parameter sequence of formation, obtain N grey association degree value;

Three, identification and matching is extracted maximal value R in N the grey association degree value _MaxWith R _θRelatively, if R _Max>=R, then coupling not, does not then match.

According to the method for distinguishing speek person of a kind of and text-independent of one embodiment of the present invention, the step of the feature extraction described in the step B is:

(A) signal time-frequency conversion: adopt base-2 Algorithm FFT conversion to convert the time-domain signal of speaker's voice into frequency-region signal, ask for the power spectrum of speaker's voice signal;

(B) confirm the centre frequency f of third-octave Spectral Analysis Method _c

(C) ask for the upper and lower limit frequency: the upper and lower limit frequency of third-octave and the relation between the centre frequency are:

\frac{f_{u}}{f_{d}} = 2^{1 / 3},

\frac{f_{c}}{f_{d}} = 2^{1 / 6},

\frac{f_{u}}{f_{c}} = 2^{1 / 6};

(D) sound pressure level conversion, promptly

L_{p} = 20 \lg \frac{P}{P_{0}} (dB)

P wherein ₀Be reference acoustic pressure, its value is 2 * 10 ^-5Pa;

(E) calculate each centre frequency f _cThe mean value of the power spectrum of frequency band of living in: upper and lower limit frequency and centre frequency according to third-octave become a plurality of frequency bands with the frequency partition in the power spectrum; And in each frequency band, all power magnitude are pressed the logarithm stack; Obtain the third-octave frequency spectrum, its amplitude is characteristic parameter.

According to the method for distinguishing speek person of a kind of and text-independent of one embodiment of the present invention, the detailed step that the grey association degree described in the Step II calculates is:

(F) extract the characteristic parameter sequence: the sequence X 0 that obtains speaker characteristic parameter X to be identified; And extract each characteristic parameter sequences of all registered speaker's reference library; Be characteristic parameter sequence A 1, A2,

AN of registered speaker A; The characteristic parameter sequence B 1 of registered speaker B, B2,

BN, by that analogy;

(G) structure weighted mean characteristic parameter sequence: speaker's to be identified characteristic parameter sequence is added to respectively in the recognition system in all registered speaker's reference library; And give these characteristic parameter sequences equably with identical weight coefficient according to the timeinvariance of frequency-region signal, so that reconfiguring with registered speaker respectively, speaker to be identified constitutes weighted mean characteristic parameter sequence.Be that registered speaker A and speaker X to be identified constitute sequence ω ₁₁A1, ω ₁₂A2,

ω _1nAN, ω _1xX0, wherein ω ₁₁=ω ₁₂=L=ω _1n=ω _1xAnd ω ₁₁+ ω ₁₂+ L+ ω _1n+ ω _1x=1; Registered speaker B and speaker X to be identified constitute sequence ω ₂₁B1, ω ₂₂B2,

ω _2nBN, ω _2xX0, wherein ω ₂₁=ω ₂₂=L=ω _2n=ω _2xAnd ω ₂₁+ ω ₂₂+ L+ ω _2n+ ω _2x=1, by that analogy;

(H) add up and generate weighted mean grey correlation characteristic parameter sequence: try to achieve all registered speakers' in speaker to be identified and the recognition system weighted mean grey correlation characteristic parameter sequence, promptly registered speaker A and the new characteristic parameter sequence A Y=ω of speaker X to be identified formation respectively according to superposition principle ₁₁A1+ ω ₁₂A2+L+ ω _1nAN+ ω _1xX1, registered speaker B and speaker X to be identified constitute new characteristic parameter sequence B Y=ω ₂₁B1+ ω ₂₂B2+L+ ω _2nBN+ ω _2xX1, by that analogy;

(I) calculate the grey association degree: by grey association degree algorithm computation speaker to be identified and registered speaker's grey association degree; Be the grey association degree RA of registered speaker A and speaker X to be identified; The grey association degree RB of registered speaker B and speaker X to be identified; By that analogy, obtain N grey association degree R.

According to the method for distinguishing speek person of a kind of and text-independent of one embodiment of the present invention, definite method of the centre frequency of described third-octave Spectral Analysis Method is:

The centre frequency of third-octave is f _c=1000 * 10 ^3n/30HZ (n=0, ± 1, ± 2, K);

Choose the approximate value of centre frequency, promptly selected centre frequency is: 20HZ, 25HZ, 31.5HZ, 40HZ, 50HZ, 63HZ, 80HZ; 100HZ, 125HZ, 160HZ, 200HZ, 250HZ, 315HZ, 400HZ, 500HZ; 630HZ, 800HZ, 1000HZ, 1350HZ, 1600HZ, 2000HZ, 2500HZ, 3150HZ; 4000HZ, 5000HZ, 6300HZ, 8000HZ, 10000HZ, 12500HZ, 16000HZ.

According to the method for distinguishing speek person of a kind of and text-independent of one embodiment of the present invention, the algorithm of described grey association degree is:

If X={x _σ(t) | σ=0,1,2, K, m} are the serial correlation factor set, i.e. reference library, x ₀Be reference function (female factor), i.e. one of them registered speaker;

x _iBe comparison function (sub-factor) i.e. speaker's to be measured characteristic factor X, x _σ(k) be x _σThe value of ordering at k, wherein, i=1,2, K, m, k=1,2, K, n.

For x ₀, x _i, order:

ζ_{i} (k) = \frac{ξ \cdot \max_{i &Element; m} \max_{k &Element; n} | x_{0} (k) - x_{i} (k) |}{λ_{1} | \underset{i &Element; m}{x_{0} (k)} - \underset{k &Element; n}{x_{i} (k)} | + λ_{2} | \underset{i &Element; m}{x_{0}^{'}} - \underset{k &Element; n}{x_{i}^{'}} (k) | + ξ \cdot \max \max | x_{0} (k) - x_{i} (k) |}

X then _iFor x ₀The grey degree of association be:

γ_{i} = γ (x_{0}, x_{i}) = \frac{1}{n} \cdot Σ_{k = 1}^{n} ζ_{i} (k)

Wherein, 0＜ε＜1, λ ₁, λ ₂>=0, λ ₁+ λ ₂=1, constant ξ is a resolution ratio, λ ₁, λ ₂Be respectively displacement and rate of change weighting coefficient, in practical application, can suitably choose ξ, λ as the case may be ₁, λ ₂

The effect that the present invention is useful is: the present invention adopts the third-octave Spectral Analysis Method that speaker's voice signal is carried out characteristic parameter extraction; The information that voice signal contained in the whole audio range of 20HZ-20KHZ of hearing people's ear more fully extracts, and has reduced the adverse effect that the characteristic information of voice signal in the Speaker Identification process does not bring entirely; This invention is carried out Speaker Identification through grey association degree algorithm, considers the information and the effect of information change in Speaker Identification of containing in speaker's voice signal simultaneously, has reduced the error rate of Speaker Identification.This based on third-octave and grey association and method for distinguishing speek person text-independent; Realized robustness with the Speaker Identification of text-independent; Improved the discrimination with speaker's voice signal of text-independent significantly, be with a wide range of applications.

Description of drawings

Fig. 1 is the process flow diagram of method provided by the invention;

Fig. 2 is a third-octave feature extraction process flow diagram of the present invention;

Fig. 3 is FFT butterfly computation symbol figure of the present invention;

Fig. 4 is a grey association degree algorithm flow chart of the present invention;

Fig. 5 is identification and matching of the present invention and strategy choice process flow diagram;

Fig. 6 is one section voice signal figure of speaker A of the present invention;

Fig. 7 is the frame signal figure of voice after one section pre-service of speaker A of the present invention;

Fig. 8 is the width of cloth third-octave spectrogram of speaker A of the present invention.

Embodiment

Through accompanying drawing and embodiment technical scheme of the present invention is done further detailed description below.Method of the present invention was divided into for five steps, shown in accompanying drawing 1.

The first step: voice signal pre-service

1, sampling and quantification

A), with the FIR BPF. to voice signal through the row filtering, make nyquist frequency F _NBe 20KHZ;

B), speech sample frequency F>=2F is set _N, get it among the embodiment according to the invention and be F=51200HZ;

C), to voice signal s _a(t) sample by the cycle, obtain the voice signal amplitude sequence Wherein t representes that this voice signal is a time-continuous signal, and n then representes the discrete signal sequence, is taken as continuous natural number during the n value and gets final product;

D), the amplitude sequence s (n) of audio digital signals is carried out quantization encoding, the quantized value that obtains amplitude sequence is represented s ' (n) with pulse code (PCM).

2, zero-suppress and float

A), the quantized value that calculates amplitude sequence is represented s ' mean value (n)

B), each amplitude in the amplitude sequence is deducted mean value respectively, obtain zero-suppressing float back mean value be 0 amplitude sequence s " (n);

3, pre-emphasis is handled

A), Z transfer function H (the z)=1-az of digital filter is set ^-1In pre emphasis factor a, the desirable ratio of a 1 is slightly for a short time to be value, getting it in the present embodiment is 0.96;

B), s ", obtain the suitable amplitude sequence s of high, medium and low frequency amplitude of voice signal " (n) through digital filter ' (n).

4, windowing

A), the frame length N of computing voice frame, N satisfies:

20 \leq \frac{N}{F} \leq 30,

Wherein, F is the speech sample rate, and unit is HZ;

B), be that frame length, N/2 are the frame amount of moving with N, s " ' (n) be divided into a series of speech frame F _m, each speech frame F _mComprise N voice signal sample;

C), calculate Hamming window function:

N is each audio frame F in the formula _mFrame length;

D), to each speech frame F _mAdd Hamming window:

Utilize formula F _m' (n): F _m' (n)=ω (n) * F _m(n) respectively to each audio frame F _mAdd Hamming window, obtain adding the audio frame F behind the Hamming window _m' (n).

Second step: characteristic parameter extraction

The present invention is based on third-octave and extract the characteristic parameter of pretreated speaker's voice signal.Its algorithm flow is as shown in Figure 2, and specifically details are as follows:

1, power spectrum is asked in Fast Fourier Transform (FFT) (FFT)

The present invention adopts base-2 Algorithm FFT to convert the time-domain signal of speaker's voice signal into frequency-region signal, asks for the power spectrum sequence of speaker's voice signal.

A), voice signal sequence x (n) is carried out " base-2 decimations in time ", obtain " decimation in time " subsequence, promptly

x ₁(r)＝x(2r)，r＝0，1，2，K，N/2-1

x ₂(r)＝x(2r+1)，r＝0，1，2，K，N/2-1

Wherein, N is the length of voice signal sequence.

B), voice signal x (n) is carried out discrete Fourier transformation (DFT), obtain the frequency-region signal of speaker's voice, promptly

X (k) = Σ_{r = 0}^{N / 2 - 1} x_{1} (r) W_{N}^{} + W_{N}^{k} Σ_{r = 0}^{N / 2 - 1} x_{2} (r) W_{N}^{}

Because

W_{N}^{} = e^{- j \frac{2 π}{N} 2 kr} = e^{- j \frac{4 π}{N} kr} = W_{N / 2}^{2 kr}

Therefore, the frequency-region signal of speaker's voice does

X (k) = X_{1} (k) + W_{N}^{k} X_{2} (k), k = 0,1,2, K, N - 1

Wherein, X ₁(k) and X ₂(k) be respectively x ₁(r) and x ₂(r) N/2 point DFT, promptly

X_{1} (k) = Σ_{r = 0}^{N / 2 - 1} x_{1} (r) W_{N / 2}^{kr} = DFT [x_{1} (r)]

X_{2} (k) = Σ_{r = 0}^{N / 2 - 1} x_{2} (r) W_{N / 2}^{kr} = DFT [x_{2} (r)]

C), according to X ₁(k) and X ₂(k) periodicity (N/2) and

Symmetry Obtain FFT frequency spectrum sequence:

X (k) = X_{1} (k) + W_{N}^{k} X_{2} (k), k = 0,1,2, KN / 2 - 1

X (k + N / 2) = X_{1} (k) - W_{N}^{k} X_{2} (k), k = 0,1,2, KN / 2 - 1

Above-mentioned computing is as shown in Figure 3, so can try to achieve the FFT frequency domain power spectrum of voice signal after the pre-service.

2, confirm centre frequency

The centre frequency f of third-octave _cFor:

f _c＝1000×10 ^3n/30HZ(n＝0，±1，±2，K)

The centre frequency that the present invention adopts is its approximate value, and promptly selected centre frequency is: 20HZ, 25HZ, 31.5HZ, 40HZ, 50HZ, 63HZ, 80HZ; 100HZ, 125HZ, 160HZ, 200HZ, 250HZ, 315HZ, 400HZ, 500HZ; 630HZ, 800HZ, 1000HZ, 1350HZ, 1600HZ 2000HZ, 2500HZ, 3150HZ; 4000HZ, 5000HZ, 6300HZ, 8000HZ, 10000HZ, 12500HZ, 16000HZ.

3, ask for the bound frequency

The centre frequency f of third-octave _cFrequency band of living in is between upper limiting frequency f _uWith lower frequency limit f _dBetween.Its upper limiting frequency f _u, lower frequency limit f _dAnd centre frequency f _cBetween relation be:

\frac{f_{u}}{f_{d}} = 2^{1 / 3},

\frac{f_{c}}{f_{d}} = 2^{1 / 6},

\frac{f_{u}}{f_{c}} = 2^{1 / 6};

Each centre frequency f of third-octave _cThe bandwidth of frequency band of living in is:

Δf＝f _u-f _d＝(2 ^1/6-2- ^1/6)f _c

4, sound pressure level conversion

The whole audio range of 20HZ-20KHZ that the third-octave spectrum analysis can be heard people's ear is divided into the frequency band of 30 constant bandwidth ratios, and the sound signal that drops in these frequency bands is calculated sound pressure level.

Acoustic pressure according to sound signal can be obtained sound pressure level, and its transformational relation is:

L_{p} = 20 \lg \frac{P}{P_{0}} (dB)

Wherein, P ₀Be reference acoustic pressure, its value is 2 * 10 ^-5Pa.

5, computing center's frequency f _cSpectrum value in the frequency band of living in

According to upper and lower limit frequency and centre frequency the frequency partition in the power spectrum is become a plurality of frequency bands, synthesize the third-octave power spectrum to the power spectrum of constant bandwidth ratio.1 power spectrum that the third-octave frequency band is interior, its synthetic method does

S_{x} (f_{n}) = {&Integral;}_{f_{d}}^{f_{u}} S_{x} (f) df

In the formula, S _x(f _n) be the synthetic power spectrum in 1 third-octave frequency band; S _x(f) be the interior discrete power spectrum of 1 third-octave frequency band.

For the discrete power spectrum, the power spectrum of n frequency band does

S_{x, n} = \underset{f_{l, n} \leq f_{i} < f_{u, n}}{Σ} \ln (S_{x, n} (f_{i}))

In the formula, S _{X, n}(f _i) be the power spectrum amplitude of each discrete frequency in this frequency band.

The mean value of band power spectrum is the amplitude A of this frequency band _n, promptly

A_{n} = \frac{1}{n} S_{x, n}

The pairing amplitude of frequency band of 30 constant bandwidth ratios is speaker's characteristic parameter in the frequency spectrum, and these 30 characteristic parameters constitute speaker's characteristic parameter sequence.The 3rd step: set up speaker's reference library

Repeat the first step and the second step several times; Set up all registered speakers' in the Speaker Recognition System characteristic parameter reference library; Promptly characteristic parameter sequence A 1, A2,

AN by registered speaker A constitutes its reference library; Characteristic parameter sequence B 1, B2,

BN by registered speaker B constitute its reference library, and

so sets up all registered speakers' in the Speaker Recognition System reference library.14 registered speakers are arranged in the present embodiment, 5 characteristic parameter sequences are arranged in every speaker's the reference library.

The 4th step: ask for the grey association degree

Grey association degree algorithm flow is as shown in Figure 4 among the present invention, and specifically details are as follows:

1, structural attitude parameter association group

A), obtain the characteristic parameter sequence X 0 of speaker X to be identified; And extract each characteristic parameter sequence in all registered speaker's reference library; Be characteristic parameter sequence A 1, A2,

BN, by that analogy.

B), speaker's to be identified characteristic parameter sequence is added to respectively in the recognition system in all registered speaker's reference library; And give these characteristic parameter sequences equably with identical weight coefficient according to the timeinvariance of frequency-region signal, so that reconfiguring with registered speaker respectively, speaker to be identified constitutes weighted mean characteristic parameter sequence.Be that registered speaker A and speaker X to be identified constitute sequence ω ₁₁A1, ω ₁₂A2,

ω _1nAN, ω _1xX0, wherein ω ₁₁=ω ₁₂=L=ω _1n=ω _1xAnd ω ₁₁+ ω ₁₂+ L+ ω _1n+ ω _1x=1; Registered speaker B and speaker X to be identified constitute sequence ω ₂₁B ₁, ω ₂₂B2,

ω _2nBN, ω _2xX0, wherein ω ₂₁=ω ₂₂=L=ω _2n=ω _2xAnd ω ₂₁+ ω ₂₂+ L+ ω _2n+ ω _2x=1; By that analogy.

C), try to achieve all registered speakers' in speaker to be identified and the recognition system grey correlation weighted mean characteristic parameter sequence respectively, promptly registered speaker A and speaker X to be identified constitute new characteristic parameter sequence A Y=ω according to superposition principle ₁₁A1+ ω ₁₂A2+L+ ω _1nAN+ ω _1xX1, registered speaker B and speaker X to be identified constitute new characteristic parameter sequence B Y=ω ₂₁B1+ ω ₂₂B2+L+ ω 2 _nBN+ ω _2xX1, by that analogy.

D), establish X={x _σ(t) | σ=0,1,2, K, m} are the serial correlation factor set, x ₀Be reference function (female factor), x _iBe comparison function (sub-factor), x _σ(k) be x _σThe value of ordering at k, wherein, i=1,2, K, m, k=1,2, K, n.

For x ₀, x _i, order

ζ_{i} (k) = \frac{ξ \cdot \max_{i &Element; m} \max_{k &Element; n} | x_{0} (k) - x_{i} (k) |}{λ_{1} | \underset{i &Element; m}{x_{0} (k)} - \underset{k &Element; n}{x_{i} (k)} | + λ_{2} | \underset{i &Element; m}{x_{0}^{'}} - \underset{k &Element; n}{x_{i}^{'}} (k) | + ξ \cdot \max \max | x_{0} (k) - x_{i} (k) |}

Obtain x _iFor x ₀The grey degree of association

γ_{i} = γ (x_{0}, x_{i}) = \frac{1}{n} \cdot Σ_{k = 1}^{n} ζ_{i} (k)

In the present embodiment, get resolution coefficient ξ=0.9, displacement weighting coefficient λ ₁=0.95, rate of change weighting coefficient λ ₂=0.05.Calculate speaker to be identified and registered speaker's grey association degree value according to above-mentioned steps; Be the grey association degree value RA of registered speaker A and speaker X to be identified; The grey association degree value RB of registered speaker B and speaker X to be identified, by that analogy.

The 5th step: identification and matching and strategy choice

Speaker Identification coupling and tactful choice process are as shown in Figure 5, specific as follows among the present invention:

1, obtains grey association degree maximal value

In speaker to be identified and all registered grey association degree values of saying the people, extract grey association degree maximal value, i.e. R _Max=max{RA, RB, K}, wherein, RA is the grey association degree value of speaker X to be identified and registered speaker A, RB is the grey association degree value of speaker X to be identified and registered speaker B, by that analogy.

2, Speaker Identification coupling and strategy choice

With the grey association degree maximal value R that extracts _MaxWith grey association degree recognition threshold R _θRelatively, if R _Max>=R then matees successfully, has the registered speaker of maximum weighted grey relational grade value in the artificial recognition system of speaking promptly to be identified with it; Otherwise the coupling failure, people promptly to be identified is not registered speaker in the recognition system.Wherein, grey association degree recognition threshold R _θProvide by a large amount of experiment statistics analyses.

Present embodiment is gathered 14 speakers' (7 men, 7 woman) voice signal, and every speaker enrolls 10 sections different content of text, every section duration 28 seconds, and the content of text between each speaker is also different.In order to reduce to gather beginning and to conclude by saying the voice difference that words voice change of tune disorder brings, clipped every section voice signal head and the tail each 3 seconds, then every section voice signal duration is 22 seconds.On this basis,, carry out voice signal pre-service and characteristic parameter extraction, set up registered speaker's characteristic parameter reference library by above-mentioned embodiment to the optional respectively 5 sections voice signals of every speaker; Then appoint and get the voice signal of one section remainder, carry out voice signal pre-service and characteristic parameter extraction, obtain speaker's to be identified characteristic parameter sequence, and calculate the grey association degree by above-mentioned embodiment by above-mentioned embodiment; Extract maximum weighted grey relational grade value at last, compare, draw the Speaker Identification result with grey association degree recognition threshold.Represent above-mentioned speaker with A, B, C, D, E, F, G, H, I, J, K, L, M, N, the practical implementation step of present embodiment is detailed at present.

Extract one section voice signal of the speaker A that has gathered, this time domain signal is shown in accompanying drawing 6; By above-mentioned embodiment successively to its carry out sample quantization, zero-suppress float, pre-emphasis and windowing, obtain pretreated voice signal, the frame signal of its voice is shown in accompanying drawing 7; Then adopt the third-octave Spectral Analysis Method that pretreated voice signal is carried out feature extraction, obtain the third-octave frequency spectrum, shown in accompanying drawing 8, its characteristic parameter sequence of reentrying, as shown in table 1.

The characteristic parameter sequence of form 1 registered speaker A

According to above-mentioned steps, the other four sections voice signals to speaker A carry out feature extraction respectively, obtain its characteristic parameter sequence, make up all characteristic parameter sequences of speaker A again, set up the characteristic parameter reference library of speaker A, and are as shown in table 2.According to the step of the characteristic parameter reference library of setting up speaker A, set up the characteristic parameter reference library of speaker B, C, D, E, F, G, H, I, J, K, L, M, N more respectively successively.

The characteristic parameter reference library of form 2 registered speaker A

Appoint and get the remaining voice signal of one section speaker A,, carry out voice signal pre-service and characteristic parameter extraction successively, obtain people's to be identified characteristic parameter sequence according to above-mentioned implementation step.According to grey association degree algorithm provided by the present invention, ask for the grey association degree of speaker A to be identified and registered speaker A, B, C, D, E, F, G, H, I, J, K, L, M, N, its result is as shown in table 3.

Form	A	B	C	D	E	F	G
								A	0.9528	0.8006	0.7440	0.8039	0.7995	0.8598	0.8016
	H	I	J	K	L	M	N
								A	0.7903	0.8267	0.7804	0.8741	0.8057	0.8887	0.7945

Extract one section remaining voice signal of other speakers successively arbitrarily; Method of operating according to speaker A to be identified; Ask for speaker B to be identified, C, D, E, F, G, H, I, J, K, L, M, N and all registered speakers' grey association degree; Its result is as shown in table 4, the horizontal registered speaker of letter representation, vertically letter representation speaker to be identified in the table.

Grey association degree between form 4 all speakers to be identified and all registered speakers

	A	B	C	D	E	F	G
								A	0.9528	0.8006	0.7440	0.8039	0.7995	0.8598	0.8016
B	0.8295	0.9050	0.8281	0.8699	0.8693	0.8387	0.8967
								C	0.7306	0.8556	0.9628	0.8324	0.7968	0.7509	0.8407
D	0.7935	0.8371	0.7769	0.8762	0.8421	0.8335	0.8324
								E	0.8214	0.8601	0.8119	0.8426	0.9645	0.8501	0.8921
F	0.8659	0.8292	0.7851	0.8391	0.8647	0.9489	0.8447
								G	0.7940	0.9030	0.8868	0.8750	0.8899	0.8159	0.9324
H	0.7799	0.7990	0.8216	0.7979	0.7488	0.7641	0.7857
								I	0.7949	0.8201	0.7710	0.8335	0.8437	0.8091	0.8178
J	0.8086	0.7748	0.8327	0.8450	0.8106	0.8024	0.8251
								K	0.8710	0.7829	0.7517	0.8055	0.7924	0.8763	0.8041
L	0.8142	0.8276	0.8629	0.8865	0.9038	0.8343	0.9274
								M	0.8958	0.8350	0.7777	0.8239	0.8207	0.8965	0.8273
N	0.8103	0.8896	0.8593	0.8784	0.8838	0.8242	0.9081
									H	I	J	K	L	M	N
A	0.7903	0.8267	0.7804	0.8741	0.8057	0.8887	0.7945
								B	0.7761	0.8681	0.7816	0.8188	0.8749	0.8415	0.8675
C	0.798	0.8425	0.8151	0.7278	0.8138	0.7466	0.8425
								D	0.7182	0.8530	0.7202	0.7804	0.8238	0.7953	0.8465
E	0.7697	0.8717	0.7671	0.8049	0.9012	0.8349	0.8842
								F	0.7909	0.8717	0.7925	0.8900	0.8479	0.9072	0.8325
G	0.8190	0.8892	0.8326	0.7916	0.9209	0.8058	0.9047
								H	0.9432	0.8127	0.8982	0.8106	0.7702	0.8063	0.7913
I	0.7299	0.9198	0.7214	0.7715	0.8157	0.7775	0.8432
								J	0.8935	0.7634	0.9605	0.8445	0.8095	0.8514	0.8099
K	0.8380	0.8286	0.8370	0.9502	0.8075	0.9011	0.7990
								L	0.8127	0.8667	0.8234	0.8117	0.9435	0.8227	0.9051
M	0.8359	0.8318	0.8401	0.9094	0.8235	0.9565	0.815
								N	0.8053	0.8598	0.8058	0.805	0.8984	0.8158	0.9310

By the maximal value of grey association degree between above-mentioned embodiment all speakers to be identified of extraction and all registered speakers, see the numerical value of the overstriking in the table 4 for details.Through the analysis to a large amount of experimental results, the grey association degree recognition threshold that present embodiment is chosen Speaker Identification is 0.9.Acquired maximal value and its are compared, draw the Speaker Identification result, as shown in table 5.

Form 5 Speaker Identification results

The total number of persons of Speaker Identification (position)	14
		Grey association degree maximal value is greater than the number (position) of recognition threshold	13
Accuracy with the Speaker Identification of text-independent	92.86％

Recognition result shown in the table 5; Show provided by the invention based on third-octave and grey association and method for distinguishing speek person text-independent; Adopt the third-octave Spectral Analysis Method that speaker's voice signal is carried out characteristic parameter extraction, carry out Speaker Identification, improved accuracy with the Speaker Identification of text-independent through grey association degree algorithm; Realized and the robustness of the Speaker Identification of text-independent, be with a wide range of applications.

More than to provided by the present invention a kind of based on third-octave and grey association and method for distinguishing speek person text-independent; Carried out detailed concrete introduction; And principle of the present invention and embodiment have further been set forth through concrete embodiment; The explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof, rather than its invention is limited, and is any in the protection domain of spirit of the present invention and claim; Any modification and change to the present invention makes all fall into protection scope of the present invention.

Claims

1. the method for distinguishing speek person with text-independent is characterized in that, comprises the steps:

One, sets up N speaker's phonetic feature reference library, set grey association degree recognition threshold R _θDescribed N is the integer more than or equal to 1, and step is following:

B, to 1-1 audio frame F _m' (n) use the third-octave Spectral Analysis Method, obtain the 1-1 characteristic parameter, described characteristic parameter is the corresponding power spectral value sequence of each centre frequency frequency band of living in,

C, a N speaker carry out M A, B step successively, obtain N * M characteristic parameter successively, and described N characteristic parameter sequence forms the phonetic feature reference library;

Two, obtain N grey association degree, step is following:

I, gather speaker characteristic parameter X to be measured through steps A, B;

2. the method for distinguishing speek person of a kind of and text-independent according to claim 1, it is characterized in that: the step of the feature extraction described in the step B is:

(B) confirm the centre frequency f of third-octave Spectral Analysis Method _c

(C) ask for the upper and lower limit frequency: the upper and lower limit frequency of third-octave and the relation between the centre frequency do

\frac{f_{u}}{f_{d}} = 2^{1 / 3},

\frac{f_{c}}{f_{d}} = 2^{1 / 6},

\frac{f_{u}}{f_{c}} = 2^{1 / 6};

(D) sound pressure level conversion, promptly

L_{p} = 20 \lg \frac{P}{P_{0}} (dB)

P wherein ₀Be reference acoustic pressure, its value is 2 * 10 ^-5Pa;

3. the method for distinguishing speek person of a kind of and text-independent according to claim 1 is characterized in that: the detailed step that the grey association degree described in the Step II calculates is:

(F) extract the characteristic parameter sequence: the sequence X 0 that obtains speaker characteristic parameter X to be identified; And extract each characteristic parameter sequences of all registered speaker's reference library; Be characteristic parameter sequence A 1, A2, AN of registered speaker A; The characteristic parameter sequence B 1 of registered speaker B, B2,

BN, by that analogy;

4. the method for distinguishing speek person of a kind of and text-independent according to claim 2, it is characterized in that: definite method of the centre frequency of described third-octave Spectral Analysis Method is:

5. the method for distinguishing speek person of a kind of and text-independent according to claim 3, it is characterized in that: the algorithm of described grey association degree is:

Xi is comparison function (a sub-factor), i.e. speaker's to be measured characteristic factor X, x _σ(k) be x _σThe value of ordering at k, wherein, i=1,2, K, m, k=1,2, K, n.

For x ₀, x _i, order

ζ_{i} (k) = \frac{ξ \cdot \max_{i &Element; m} \max_{k &Element; n} | x_{0} (k) - x_{i} (k) |}{λ_{1} | \underset{i &Element; m}{x_{0} (k)} - \underset{k &Element; n}{x_{i} (k)} | + λ_{2} | \underset{i &Element; m}{x_{0}^{'}} - \underset{k &Element; n}{x_{i}^{'}} (k) | + ξ \cdot \max \max | x_{0} (k) - x_{i} (k) |}

X then _iFor x ₀The grey degree of association do

γ_{i} = γ (x_{0}, x_{i}) = \frac{1}{n} \cdot Σ_{k = 1}^{n} ζ_{i} (k)