CN100593194C

CN100593194C - Speaker recognizing device and speaker recognizing method

Info

Publication number: CN100593194C
Application number: CN200580003955A
Authority: CN
Inventors: 柿野友成; 伊久美智则
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2004-06-01
Filing date: 2005-05-31
Publication date: 2010-03-03
Anticipated expiration: 2025-05-31
Also published as: JP3927559B2; CN1914667A; WO2005119654A1; JP2005345598A

Abstract

To realize high-accuracy speaker recognition, a DP matching section(11) determines an optimum matching series(F) which minimizes the sum of phonological distances by using a pitch time series of two characteristic parameter time series(A, B), a speaker-to-speaker distance calculating section determines the sum of the personal distances by using the optimum matching series and the cepstrum coefficient time series of the two characteristic parameter time series(A, B), and an identifying section identifies the speaker on the basis of the sum. Thus, both phonological resolution and speaker resolution are compatible, and stable recognition performance is ensured, thereby realizing high-accuracy speaker recognition.

Description

Speaker Identification device, and method for distinguishing speek person

Technical field

The present invention relates to use the individual information that comprises in the sound wave to discern speaker's Speaker Identification device, program and method for distinguishing speek person.

Background technology

As the Speaker Identification device, propose a kind of Speaker Identification device of text dependent form of the identification (identification) of carrying out the speaker by the voice of set content, thereby proposed a kind of characteristic parameter time series identification speaker's who relatively from voice, extracts Speaker Identification device especially.

In the Speaker Identification device, generally will discern employed sound wave and be divided into every several milliseconds frame, each this frame is asked various sound parameters, for example cepstrum coefficient and as characteristic parameter (speech characteristic parameter) uses it is carried out Speaker Identification (both speaker. identification) as the seasonal effect in time series data in all between speech region.

Characteristic parameter generally comprises harmonious sounds information on first meaning, comprise individual information on second meaning.The Speaker Identification that depends on individual information is being used under the situation of such characteristic parameter, if from characteristic parameter, do not eliminate harmonious sounds information then can not guarantee stable recognition performance.

Therefore, in the Speaker Identification device of existing text dependent form, in order to eliminate harmonious sounds information, the time standard method of the non-linear contraction scale of characteristic parameter seasonal effect in time series time shaft that use will be compared (DP coupling) is calculated the distance (with reference to non-patent literature 1) between the same harmonious sounds.As shown in Figure 6, carry out the DP matching unit 200 of DP coupling and ask match pattern (pattern) (DP path), so that the distance between two characteristic parameter time series A, B comparing is for minimum.At this moment, on the DP Matching Algorithm, the DP path is obtained, and minimized distance is calculated simultaneously.Identification unit 201 carries out speaker's identification based on this minimized distance.

Non-patent literature 1: P.91-93 first published of Co., Ltd. is published in the gloomy north of Furui Sadaoki's work " sound voice and sentiment Reported processing "

Summary of the invention

But, because the minimizing of the distance between two characteristic parameter time serieses that existing DP coupling carries out comparing, therefore inappropriate as method in the hope of the Speaker Identification that is not all purpose of speaker's voice.That is, owing to the flexible time construction that destroys in the peculiar word of speaker of superfluous time, as a result of, existence can not fully be reflected in the difference between the speaker the last problem of distance.In order to address this problem, also to carry out the flexible method (matching window) that restriction is set of time, but in the method, have the problem of carrying out the danger of the correspondence between different harmonious sounds between same speaker that occurs on the contrary.These problems since the distance of asking the employed distance of optimization and being used in DP path to differentiate the speaker by same computing method cause, be difficult to carry out the high Speaker Identification of precision.

The objective of the invention is to the Speaker Identification that realizes that precision is high.

The present invention is based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carry out the Speaker Identification device of speaker's identification, it is characterized in that, described Speaker Identification device comprises: set the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series are set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the parts of the summation of this first distance; Ask best match sequence so that the summation of described first distance becomes minimum parts; Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the parts of the summation of this second distance; And the parts that carry out speaker's identification based on the summation of the described second distance of obtaining.

From another point of view, the present invention is a program of carrying out the embodied on computer readable of Speaker Identification function, this Speaker Identification function is based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carry out speaker's identification, it is characterized in that, described program makes described computing machine carry out following function: set the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series are set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the function of the summation of this first distance; Ask best match sequence so that the summation of described first distance becomes minimum function; Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the function of the summation of this second distance; And the function of carrying out speaker's identification based on the summation of the described second distance of obtaining.

From another point of view, the present invention is based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carry out the method for distinguishing speek person of speaker's identification, it is characterized in that, described Speaker Identification device comprises: set the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series are set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the step of the summation of this first distance; Ask best match sequence so that the summation of described first distance becomes minimum step; Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the step of the summation of this second distance; And the step of carrying out speaker's identification based on the summation of the described second distance of obtaining.

Description of drawings

Fig. 1 is the block scheme of structure of the Speaker Identification device of expression first embodiment of the present invention.

Fig. 2 is the block scheme of structure of the both speaker. identification unit that has of Speaker Identification device of expression first embodiment of the present invention.

Fig. 3 is the block scheme of structure of the both speaker. identification unit that has of Speaker Identification device of expression second embodiment of the present invention.

Fig. 4 is the synoptic diagram of the structure of representation feature parameter.

Fig. 5 is expression is realized the structure example of Speaker Identification device under the situation of the present invention by software a block scheme.

Fig. 6 is the block scheme of structure of the part of the existing Speaker Identification device of expression.

Embodiment

Based on Fig. 1 and Fig. 2 first embodiment of the present invention is described.Fig. 1 is the block scheme of structure of the Speaker Identification device of expression present embodiment, and Fig. 2 is the block scheme of the structure of the both speaker. identification unit that has of expression Speaker Identification device.The Speaker Identification device of present embodiment is an example of the Speaker Identification device of text dependent form.

As shown in Figure 1, Speaker Identification device 100 comprises microphone 1, low-pass filter 2, A/D converting unit 3, characteristic parameter generation unit 4, both speaker. identification unit 5, speaker model generation unit 6 and storage unit 7.

Microphone 1 is that the phonetic modification that will be transfused to is the converter unit of electric analoging signal.Low-pass filter 2 is wave filters that the frequency more than the truncation specification frequency is also exported from the simulating signal that is transfused to.A/D converting unit 3 is the simulating signals that will be transfused to the sample frequency of regulation, the converter unit that quantization digit is transformed to digital signal.Be configured for importing the phonetic entry parts of voice by these microphones 1, low-pass filter 2, A/D converting unit 3.

Characteristic parameter generation unit 4 is to extract the characteristic parameter that comprises individual information from the digital signal that has been transfused to successively, and the generation output unit exported of generating feature parameter time series (eigenvector row) back.In the present embodiment, the sound wave between 4 pairs of ensonified zone of characteristic parameter generation unit carries out the frame analysis, thereby asks Δ spacing and 16 cepstrum coefficients, generates the characteristic parameter time series that is made of Δ time interval sequence and 16 cepstrum coefficient time serieses.In addition, cepstrum coefficient seasonal effect in time series number of times is not defined as 16 times.

Speaker model generation unit 6 is the generation units that generate speaker model according to characteristic parameter time series that is generated by characteristic parameter generation unit 4 and registration speaker's ID.Storage unit 7 is storage (registration) storage unit by the speaker model of speaker model generation unit 6 generations.In the present embodiment, speaker model by registered in advance in storage unit 7.

Both speaker. identification unit 5 calculates by the characteristic parameter time series of characteristic parameter generation unit 4 generations and the distance of the speaker model of registered in advance in storage unit 7, carry out speaker's identification based on this distance, and this recognition results is exported as the Speaker Identification result.

Such both speaker. identification unit 5 comprises metrics calculation unit 12 and identification unit 13 between DP matching unit 11, the speaker as shown in Figure 2.Carry out various parts (or step) by these each unit.

To the 12 difference input characteristic parameter time series A of metrics calculation unit between DP matching unit 11 and the speaker, B.Characteristic parameter time series A, B comprise Δ time interval sequence.In addition, in the present embodiment, characteristic parameter time series A is the characteristic that generates according to the sound wave from microphone 1 input, and characteristic parameter time series B is the characteristic that is registered in the speaker model in the storage unit 7.Here, characteristic parameter time series A is the first speech characteristic parameter time series, and characteristic parameter time series B is the second speech characteristic parameter time series.Represent such characteristic parameter time series A, B below.

The characteristic parameter time series

A＝α ₁，α ₂，…，α _i，…，α _I

B＝β ₁，β ₂，…，β _j，…，β _J

Characteristic

α _i＝p _i，α _i1，α _i2，…，α _ik，…，α _i16

β _j＝q _j，β _j1，β _j2，…，β _jk，…，β _j16

PARAMETER ALPHA _i, β _jBe that the sound wave between the ensonified zone is carried out the frame analysis and the Δ spacing (p that obtains _i, q _j) and 16 cepstrum coefficient (α _I1～α _I16, β _J1～β _J16) constitute.Thereby characteristic parameter time series A, B are made of Δ time interval sequence and 16 cepstrum coefficient time serieses.Here, relatively the Δ spacing comprises more harmonious sounds information, and cepstrum coefficient comprises more individual information.

DP matching unit 11 carries out the DP matching treatment, so that corresponding between the harmonious sounds of two characteristic parameter time series A, B.At this moment, carry out optimization, so that (i, summation D j) (F) are minimum, and ask best match sequence F apart from d as the harmonious sounds of first distance by the DP matching algorithm.

Here, best match sequence F is as corresponding factor c of time _nSequence be defined like this suc as formula (1), (i j) uses the Δ spacing to be defined as following formula (1) to the harmonious sounds between each characteristic parameter, and summation D (F) is defined as following formula (3) apart from d.That is, best match sequence F, harmonious sounds apart from d (i, j) and summation D (F) obtained by following formula (1), formula (2) and formula (3) respectively.

[formula 1]

F＝c ₁，c ₂，---，c _n，---，?c _N，c _n＝(i _n，j _n) ····(1)

[formula 2]

d(i，j)＝|p _i-q _j| ····(2)

[formula 3]

D (F) = \frac{1}{I + J} Σ_{n = 1}^{N} d (c_{n}) = \frac{1}{I + J} Σ_{n = 1}^{N} d (i_{n}, j_{n}) - - - (3)

As be described in detail, DP matching unit 11 uses the Δ time interval sequence separately of two characteristic parameter time series A, B, and through type (2) is asked harmonious sounds, and (i, j), and through type (3) is asked its summation D (F) apart from d.At this moment, through type (3) and formula (1) are carried out optimization, so that summation D (F) is minimum, thereby ask best match sequence F.Here, Δ time interval sequence is the first speech characteristic parameter group.

Metrics calculation unit 12 is used the best match sequence F that is obtained by DP matching unit 11 between the speaker, and calculating is individual apart from e (i, summation E j) (F) as second distance.It is here, individual that (i j) is defined as following formula (4), and summation E (F) is defined as following formula (5) apart from e.That is, individual apart from e (i, j) and summation E (F) obtained by following formula (4) and formula (5) respectively.

[formula 4]

e (i, j) = {[Σ_{k = 1}^{16} {(α_{ik} - β_{jk})}^{2}]}^{\frac{1}{2}} - - - (4)

[formula 5]

E (F) = \frac{1}{I + J} Σ_{n = 1}^{N} e (c_{n}) = \frac{1}{I + J} Σ_{n = 1}^{N} e (i_{n}, j_{n}) - - - (5)

As be described in detail, metrics calculation unit 12 is used the cepstrum coefficient time series separately of two characteristic parameter time series A, B between the speaker, through type (4) ask individual apart from e (i, j), and based on best match sequence F, through type (5) is asked its summation E (F).In the present embodiment, as the cepstrum coefficient time series, use 1～16 time cepstrum coefficient time series.In addition, the cepstrum coefficient time series is the second speech characteristic parameter group.

Identification unit 13 carries out speaker's identification based on the summation E (F) of the individual distance of being obtained by metrics calculation unit between the speaker 12, and its recognition results is exported as the Speaker Identification result.Here, for example summation E (F) is compared with threshold value, carry out the judgement (speaker's contrast) of both speaker. identification.

Like this, according to present embodiment, use the Δ time interval sequence separately of two characteristic parameter time series A, B to ask the summation D (F) of harmonious sounds distance to be minimum optimal spacing sequence F, use the cepstrum coefficient time series separately of this best match sequence and two characteristic parameter time series A, B to ask the summation E (F) of individual distance, carry out speaker's identification based on this summation E (F).Thus, harmonious sounds decomposability when speech characteristic parameter time series A, B are mated and ask between the speech characteristic parameter time series apart from the time speaker's decomposability and deposit, stable recognition performance can be guaranteed, therefore the Speaker Identification that precision is high can be realized.In addition, the employed distance of the optimization in DP path and being used to is differentiated speaker's distance and is obtained with diverse ways, therefore it is last the difference between the speaker can be reflected to fully distance, in addition because can be in the correspondence that suppresses between same speaker between different harmonious sounds, so can realize the Speaker Identification that precision is high.

Here,, at the many positions of the variable quantity of characteristic parameter coupling takes place and departs from the possibility height of (time departs from) mutually independently under the situation at harmonious sounds distance and the employed characteristic parameter of individual distance.In this case, like this, (i, j) " on average " that distortion applies a little as following formula (6) acts on, and departs from thereby can improve coupling apart from e with harmonious sounds shown in following formula (6).

[formula 6]

e (i, j) = \min {{[Σ_{k = 1}^{16} {(α_{ik} - β_{(j - 1) k})}^{2}]}^{\frac{1}{2}}, {[Σ_{k = 1}^{16} {(α_{ik} - β_{jk})}^{2}]}^{\frac{1}{2}}, {[Σ_{k = 1}^{16} {(α_{ik} - β_{(j + 1) k})}^{2}]}^{\frac{1}{2}}} - - - (6)

In addition, by carrying out above-mentioned " on average " effect mutually, can obtain more stable harmonious sounds distance.In this case, (i j) is out of shape as following formula (7) apart from e with harmonious sounds.

It is average that mean distance is defined as both sides' addition.

[formula 7]

e (i, j) = \frac{1}{2} [\min {{Σ_{k = 1}^{16} {(α_{ik} - β_{(j - 1) k})}^{2}]}^{\frac{1}{2}}, {[Σ_{k = 1}^{16} {(α_{ik} - β_{jk})}^{2}]}^{\frac{1}{2}}, {[Σ_{k = 1}^{16} {(α_{ik} - β_{(j + 1) k})}^{2}]}^{\frac{1}{2}}}

+ \min {{[Σ_{k = 1}^{16} {(α_{(i - 1) k} - β_{jk})}^{2}]}^{\frac{1}{2}}, {[Σ_{k = 1}^{16} {(α_{ik} - β_{jk})}^{2}]}^{\frac{1}{2}}, {[Σ_{k = 1}^{16} {(α_{(i + 1) k} - β_{jk})}^{2}]}^{\frac{1}{2}}}] - - - (7)

In the present embodiment, comprise the basic frequency information time sequence that obtains from the basic frequency of voice as the first speech characteristic parameter seasonal effect in time series characteristic parameter time series A and as the second speech characteristic parameter seasonal effect in time series characteristic parameter time series B, and the sympathetic response information time sequence that obtains from the sympathetic response information of sound channel, the first speech characteristic parameter group is a basic frequency information time sequence, the second speech characteristic parameter group is a sympathetic response information time sequence, therefore can realize high-precision Speaker Identification reliably.

In the present embodiment, characteristic parameter time series A and characteristic parameter time series B comprise the Δ time interval sequence that obtains from the information of rising and falling of voice, and the cepstrum coefficient time series that obtains from the sympathetic response information of sound channel, pass through apart from e apart from d and as the individual of second distance as the harmonious sounds of first distance

[formula 8]

d＝|p _k-q _k|

e = {[Σ_{k = k 0}^{k} {(a_{k} - b_{k})}^{2}]}^{\frac{1}{2}}

k0≥1

D, e: first distance, second distance

P: the first speech characteristic parameter seasonal effect in time series Δ spacing

Q: the second speech characteristic parameter seasonal effect in time series Δ spacing

a _k: the first speech characteristic parameter seasonal effect in time series cepstrum coefficient

b _k: the second speech characteristic parameter seasonal effect in time series cepstrum coefficient

K: cepstrum number of times

And obtained, therefore can realize the Speaker Identification that precision is high more reliably.

In the present embodiment, i the PARAMETER ALPHA of characteristic parameter time series A _iJ characteristic parameter β with characteristic parameter time series B _jIndividual (i j) passes through apart from e

[formula 9]

e (i, j) = \min [\begin{matrix} dist (i, j - L) \\ dist (i, j - L + 1) \\ M \\ dist (i, j) \\ M \\ dist (i, j + L - 1) \\ dist (i, j + L) \end{matrix}]

Dist (X, Y): the distance of speech characteristic parameter X and Y

L: mean breadth (＞0)

And obtained, therefore can improve coupling and depart from.

In addition, i the PARAMETER ALPHA of characteristic parameter time series A _iJ characteristic parameter β with characteristic parameter time series B _jIndividual (i j) passes through apart from e

[formula 10]

e (i, j) = \frac{1}{2} [\min [\begin{matrix} dist (i - L, j) \\ dist (i - L + 1, j) \\ M \\ dist (i, j) \\ M \\ dist (i + L - 1, j) \\ dist (i + L, j) \end{matrix}] + \min [\begin{matrix} dist (i, j - L) \\ dist (i, j - L + 1) \\ M \\ dist (i, j) \\ M \\ dist (i, j + L - 1) \\ dist (i, j + L) \end{matrix}]]

Dist (X, Y): the distance of speech characteristic parameter X and Y

L: mean breadth (＞0)

And when being obtained, can obtain more stable harmonious sounds distance.

Based on Fig. 3 and Fig. 4 second embodiment of the present invention is described.Fig. 3 is the block scheme of structure of the both speaker. identification unit that has of Speaker Identification device of expression present embodiment, and Fig. 4 is the synoptic diagram of the structure of representation feature parameter.

Present embodiment is the variation of the both speaker. identification unit 5 shown in first embodiment.In addition, the part identical with described first embodiment represents that with same-sign the explanation beyond the both speaker. identification unit 5 is omitted.In addition, in the present embodiment, the sound wave between 4 pairs of ensonified zone of characteristic parameter generation unit carries out the frame analysis and asks cepstrum coefficient 16 times, generates the characteristic parameter time series that is made of 16 cepstrum coefficients.In addition, cepstrum coefficient seasonal effect in time series number of times is not limited to 16 times.

As shown in Figure 3, both speaker. identification unit 5 is same with first embodiment basically, comprises metrics calculation unit 12 and identification unit 13 between DP matching unit 11, the speaker.Carry out various parts (or step) by these each unit.

To the 12 difference input characteristic parameter time series A of metrics calculation unit between DP matching unit 11 and the speaker, B.In addition, in the present embodiment, characteristic parameter time series A is the characteristic that generates according to the sound wave from microphone 1 input, and characteristic parameter time series B is the characteristic that is registered in the speaker model in the storage unit 7.Here, characteristic parameter time series A is the first speech characteristic parameter time series, and characteristic parameter time series B is the second speech characteristic parameter time series.Represent such characteristic parameter time series A, B below.

The characteristic parameter time series

A＝α ₁，α ₂，…，α _i，…，α _I

B＝β ₁，β ₂，…，β _j，…，β _J

Characteristic

α _i＝α _i1，α _i2，…，α _ik，…，α _i16

β _j＝β _j1，β _j2，…，β _jk，…，β _j16

PARAMETER ALPHA _i, β _jBe to the sound wave between the ensonified zone carry out the frame analysis and obtain by 16 cepstrum coefficient (α _I1～α _I16, β _J1～β _J16) constitute.Thereby characteristic parameter time series A, B are the time serieses of 16 cepstrum coefficients.In addition, here, 1～8 time cepstrum coefficient time series is the cepstrum coefficient time series of low order, and the inferior cepstrum coefficient time series in m～16 (m＞8) is the cepstrum coefficient time series of high order.

Here, best match sequence F is as corresponding factor c of time _nSequence be defined like this suc as formula (1), (i j) uses the cepstrum coefficient of low order to be defined as following formula (8) to the harmonious sounds between each characteristic parameter, and summation D (F) is defined as following formula (3) apart from d.That is, best match sequence F, harmonious sounds apart from d (i, j) and summation D (F) obtained by following formula (1), formula (8) and formula (3) respectively.

[formula 11]

F＝c ₁，c ₂，---，c _n，---，c _N，c _n＝(i _n，j _n)····(1)

[formula 12]

d (i, j) = {[Σ_{k = 1}^{B} {(α_{ik} - β_{jk})}^{2}]}^{\frac{1}{2}} - - - (8)

[formula 13]

D (F) = \frac{1}{I + J} Σ_{n = 1}^{N} d (c_{n}) = \frac{1}{I + J} Σ_{n = 1}^{N} d (i_{n}, j_{n}) - - - (3)

As be described in detail, DP matching unit 11 uses the cepstrum coefficient time series (1～8 cepstrum coefficient time series) of the low order separately of two characteristic parameter time series A, B, through type (8) is asked harmonious sounds, and (i, j), and through type (3) is asked its summation D (F) apart from d.At this moment, through type (3) and formula (1) are carried out optimization, so that summation D (F) is minimum, thereby ask best match sequence F.Here, the cepstrum coefficient time series of low order is the first speech characteristic parameter group.

Metrics calculation unit 12 is used the best match sequence F that is obtained by DP matching unit 11 between the speaker, calculates as individual apart from e (i, summation E j) (F).It is here, individual that (i j) is defined as following formula (4), and summation E (F) is defined as following formula (5) apart from e.That is, individual apart from e (i, j) and summation E (F) obtained by following formula (4) and formula (5) respectively.

[formula 14]

e (i, j) = {[Σ_{k = 1}^{16} {(α_{ik} - β_{jk})}^{2}]}^{\frac{1}{2}} - - - (4)

[formula 15]

E (F) = \frac{1}{I + J} Σ_{n = 1}^{N} e (c_{n}) = \frac{1}{I + J} Σ_{n = 1}^{N} e (i_{n}, j_{n}) - - - (5)

As be described in detail, metrics calculation unit 12 is used the cepstrum coefficient time series of the cepstrum coefficient time series (the inferior cepstrum coefficient time series in m～16 (m＞8)) of the high order separately that comprises two characteristic parameter time series A, B between the speaker, it is individual apart from e (i that through type (4) is asked, j), and based on best match sequence F, through type (5) is asked its summation E (F).In the present embodiment, as the cepstrum coefficient time series, use 1～16 time cepstrum coefficient time series.Here, the cepstrum coefficient of high order generally comprises more individual information than the cepstrum coefficient of low order.In addition, the cepstrum coefficient time series is the second speech characteristic parameter group.

Here, as shown in Figure 4, in having 1～N time the characteristic parameter of cepstrum coefficient, under with 1～n time the situation of cepstrum coefficient as the cepstrum coefficient (Fig. 4 (a) bend part) of low order, the cepstrum coefficient of high order is m～N (cepstrum coefficient that m＞n) is inferior.The cepstrum coefficient of this high order is the cepstrum coefficient time series of high order by the sequence of time seriesization.Thereby, the cepstrum coefficient seasonal effect in time series cepstrum coefficient time series that comprises high order also can be only by m～N (time series that the inferior cepstrum coefficient (netting twine part among Fig. 4 (b)) of m＞n) constitutes, perhaps also can be by m～N (time series that the part of the cepstrum coefficient that m＞n) is inferior and the cepstrum coefficient of low order (netting twine part among Fig. 4 (c)) constitutes, and then also can be time series by 1～N time cepstrum coefficient (netting twine part among Fig. 4 (d)) formation.In addition, in the present embodiment, be set at N=16 and n=8, but be not limited thereto.

Like this, according to present embodiment, use the cepstrum coefficient time series of the low order separately of two characteristic parameter time series A, B to ask the summation D (F) of harmonious sounds distance to be minimum optimal spacing sequence F, use this best match sequence and the cepstrum coefficient seasonal effect in time series cepstrum coefficient time series that comprises the high order separately of two characteristic parameter time series A, B to ask the summation E (F) of individual distance, carry out speaker's identification based on this summation E (F).Thus, harmonious sounds when speech characteristic parameter time series A, B are mated explanation performance and ask between the speech characteristic parameter time series apart from the time the speaker offer an explanation performance and deposit, stable recognition performance can be guaranteed, therefore the Speaker Identification that precision is high can be realized.In addition, the employed distance of the optimization in DP path and being used to is differentiated speaker's distance and is obtained with diverse ways, therefore it is last the difference between the speaker can be reflected to fully distance, in addition because can be in the correspondence that suppresses between same speaker between different harmonious sounds, so can realize the Speaker Identification that precision is high.

In the present embodiment, as the first speech characteristic parameter seasonal effect in time series characteristic parameter time series A and as the second speech characteristic parameter seasonal effect in time series characteristic parameter time series B is the cepstrum coefficient time series that the sympathetic response information from sound channel obtains, the first speech characteristic parameter group is the cepstrum coefficient time series of the low order in the cepstrum coefficient time series, the second speech characteristic parameter group is the cepstrum coefficient seasonal effect in time series cepstrum coefficient time series that comprises the high order in the cepstrum coefficient time series, therefore can realize high-precision Speaker Identification reliably.

In the present embodiment, as the first speech characteristic parameter seasonal effect in time series characteristic parameter time series A and as the second speech characteristic parameter seasonal effect in time series characteristic parameter time series B is the cepstrum coefficient time series that the sympathetic response information from sound channel obtains, and passes through apart from e apart from d and as the individual of second distance as the harmonious sounds of first distance

[formula 16]

d = {[Σ_{k = 1}^{N} {(a_{k} - b_{k})}^{2}]}^{\frac{1}{2}}

e = {[Σ_{k = k 0}^{M} {(a_{k} - b_{k})}^{2}]}^{\frac{1}{2}}

N＜M，k0≥1

D, e: first distance, second distance

K: cepstrum number of times

And obtained, therefore can realize high-precision Speaker Identification reliably.

In addition, the present invention is not limited to the specific hardware configuration shown in the embodiment as the aforementioned, also can realize with software.That is, available software realizes the function (Speaker Identification function) of both speaker. identification unit 5.Fig. 5 is expression is realized the structure example of the Speaker Identification device 100 under the situation of the present invention by software a block scheme.

As shown in Figure 5, Speaker Identification device 100 comprises the CPU101 of the each several part of this Speaker Identification device 100 of centralized control, this CPU101 goes up by bus and connects ROM that has stored BIOS etc. or the storer 102 that is made of the RAM that can rewrite the ground store various kinds of data, constitutes microcomputer.In addition, CPU101 is last to be connected with HDD (Hard Disk Drive via not shown I/O bus, hard disk drive) 103, to the CD CD-ROM drive 105 that (Compact Disc, CD)-ROM104 reads of the storage medium of embodied on computer readable, be responsible for display device 108, the microphones 1 such as the communicator of communicating by letter 106, keyboard 107, CRT or LCD of Speaker Identification device 100 and the Internet etc.

Stored in the storage medium of embodied on computer readable such as CD-ROM104 and realized Speaker Identification functional programs of the present invention,, can make CPU101 carry out Speaker Identification function of the present invention by this program is installed in the Speaker Identification device 100.In addition, from the voice of microphone 1 input are stored in HDD103 etc. temporarily.Then, when program was started, the interim speech data of preserving was read among the HDD103 etc., carried out Speaker Identification and handled.This Speaker Identification is handled and is realized and the same functions of each several part such as characteristic parameter generation unit 4 or both speaker. identification unit 5.Thus, can obtain the effect same with the effect of described embodiment.

In addition,, not only can use CD-ROM104, also can use the medium of variety of ways such as various disks, semiconductor memory such as various CDs, various photomagneto disk, floppy disk such as DVD as storage medium.In addition, also can be etc. from the Internet network download and being installed in the HDD103.In this case, the memory storage of having stored program in the server as transmitting terminal also becomes storage medium of the present invention.In addition, program can be OS (the Operating System in regulation, operating system) go up the program of moving, in this case, also can be the program of the execution of each part of handling described later being transferred to OS, also can be the program that comprises as the part of the batch processing file of the application software of regulations such as word processor or formation OS etc.

Claims

1. a Speaker Identification device based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carries out speaker's identification, it is characterized in that described Speaker Identification device comprises:

Setting makes the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the parts of the summation of this first distance;

Ask best match sequence so that the summation of described first distance becomes minimum parts;

Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the parts of the summation of this second distance; And

Carry out the parts of speaker's identification based on the summation of the described second distance of obtaining.

2. Speaker Identification device as claimed in claim 1 is characterized in that,

Described first speech characteristic parameter time series and the described second speech characteristic parameter time series comprise the basic frequency information time sequence that obtains from the basic frequency of voice, and the sympathetic response information time sequence that obtains from the sympathetic response information of sound channel,

The described first speech characteristic parameter group is described basic frequency information time sequence,

The described second speech characteristic parameter group is described sympathetic response information time sequence.

3. Speaker Identification device as claimed in claim 1 is characterized in that,

Described first speech characteristic parameter time series and the described second speech characteristic parameter time series are the cepstrum coefficient time serieses that the sympathetic response information from sound channel obtains,

The described first speech characteristic parameter group is the cepstrum coefficient time series of the low order in the described cepstrum coefficient time series,

The described second speech characteristic parameter group is the cepstrum coefficient seasonal effect in time series cepstrum coefficient time series that comprises the high order in the described cepstrum coefficient time series.

4. Speaker Identification device as claimed in claim 1 is characterized in that,

Described first speech characteristic parameter time series and the described second speech characteristic parameter time series comprise the Δ time interval sequence that obtains from the information of rising and falling of voice, and the cepstrum coefficient time series that obtains from the sympathetic response information of sound channel,

Described first passes through apart from d and described second distance e

[formula 1]

d＝|p _k-q _k|

e = {[Σ_{k = k 0}^{k} {(a_{k} - b_{k})}^{2}]}^{\frac{1}{2}}

k0≥1

D, e: first distance, second distance

K: cepstrum number of times

And obtained.

5. Speaker Identification device as claimed in claim 1 is characterized in that,

(i j) passes through the described second distance e of described i speech characteristic parameter of the first speech characteristic parameter seasonal effect in time series and j speech characteristic parameter of the described second speech characteristic parameter seasonal effect in time series

[formula 2]

e (i, j) = \min [\begin{matrix} dist (i, j - L) \\ dist (i, j - L + 1) \\ M \\ dist (i, j) \\ M \\ dist (i, j + L - 1) \\ dist (i, j + L) \end{matrix}]

Dist (X, Y): the distance of speech characteristic parameter X and Y

L: mean breadth, and greater than 0

And obtained.

6. Speaker Identification device as claimed in claim 1 is characterized in that,

[formula 3]

e (i, j) = \frac{1}{2} [\begin{matrix} \min [\begin{matrix} dist (i - L, j) \\ dist (i - L + 1, j) \\ M \\ dist (i, j) \\ M \\ dist (i + L - 1, j) \\ dist (i + L, j) \end{matrix}] + \min \end{matrix} [\begin{matrix} dist (i, j - L) \\ dist (i, j - L + 1) \\ M \\ dist (i, j) \\ M \\ dist (i, j + L - 1) \\ dist (i, j + L) \end{matrix}]]

Dist (X, Y): the distance of speech characteristic parameter X and Y

L: mean breadth, and greater than 0

And obtained.

7. Speaker Identification device as claimed in claim 1 is characterized in that,

Described first passes through apart from d and described second distance

[formula 4]

d = {[Σ_{k = 1}^{N} {(a_{k} - b_{k})}^{2}]}^{\frac{1}{2}}

e = {[Σ_{k = k 0}^{M} {(a_{k} - b_{k})}^{2}]}^{\frac{1}{2}}

N＜M，k0≥1

D, e: first distance, second distance

K: cepstrum number of times

And obtained.

8. a method for distinguishing speek person based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carries out speaker's identification, it is characterized in that described method for distinguishing speek person comprises:

Set the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series are set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the step of the summation of this first distance;

Ask best match sequence so that the summation of described first distance becomes minimum step;

Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the step of the summation of this second distance; And

Carry out the step of speaker's identification based on the summation of the described second distance of obtaining.