CN100593194C - Speaker recognizing device and speaker recognizing method - Google Patents

Speaker recognizing device and speaker recognizing method Download PDF

Info

Publication number
CN100593194C
CN100593194C CN200580003955A CN200580003955A CN100593194C CN 100593194 C CN100593194 C CN 100593194C CN 200580003955 A CN200580003955 A CN 200580003955A CN 200580003955 A CN200580003955 A CN 200580003955A CN 100593194 C CN100593194 C CN 100593194C
Authority
CN
China
Prior art keywords
characteristic parameter
time series
speech characteristic
distance
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200580003955A
Other languages
Chinese (zh)
Other versions
CN1914667A (en
Inventor
柿野友成
伊久美智则
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba TEC Corp
Original Assignee
Toshiba TEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba TEC Corp filed Critical Toshiba TEC Corp
Publication of CN1914667A publication Critical patent/CN1914667A/en
Application granted granted Critical
Publication of CN100593194C publication Critical patent/CN100593194C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates

Abstract

To realize high-accuracy speaker recognition, a DP matching section(11) determines an optimum matching series(F) which minimizes the sum of phonological distances by using a pitch time series of two characteristic parameter time series(A, B), a speaker-to-speaker distance calculating section determines the sum of the personal distances by using the optimum matching series and the cepstrum coefficient time series of the two characteristic parameter time series(A, B), and an identifying section identifies the speaker on the basis of the sum. Thus, both phonological resolution and speaker resolution are compatible, and stable recognition performance is ensured, thereby realizing high-accuracy speaker recognition.

Description

Speaker Identification device, and method for distinguishing speek person
Technical field
The present invention relates to use the individual information that comprises in the sound wave to discern speaker's Speaker Identification device, program and method for distinguishing speek person.
Background technology
As the Speaker Identification device, propose a kind of Speaker Identification device of text dependent form of the identification (identification) of carrying out the speaker by the voice of set content, thereby proposed a kind of characteristic parameter time series identification speaker's who relatively from voice, extracts Speaker Identification device especially.
In the Speaker Identification device, generally will discern employed sound wave and be divided into every several milliseconds frame, each this frame is asked various sound parameters, for example cepstrum coefficient and as characteristic parameter (speech characteristic parameter) uses it is carried out Speaker Identification (both speaker. identification) as the seasonal effect in time series data in all between speech region.
Characteristic parameter generally comprises harmonious sounds information on first meaning, comprise individual information on second meaning.The Speaker Identification that depends on individual information is being used under the situation of such characteristic parameter, if from characteristic parameter, do not eliminate harmonious sounds information then can not guarantee stable recognition performance.
Therefore, in the Speaker Identification device of existing text dependent form, in order to eliminate harmonious sounds information, the time standard method of the non-linear contraction scale of characteristic parameter seasonal effect in time series time shaft that use will be compared (DP coupling) is calculated the distance (with reference to non-patent literature 1) between the same harmonious sounds.As shown in Figure 6, carry out the DP matching unit 200 of DP coupling and ask match pattern (pattern) (DP path), so that the distance between two characteristic parameter time series A, B comparing is for minimum.At this moment, on the DP Matching Algorithm, the DP path is obtained, and minimized distance is calculated simultaneously.Identification unit 201 carries out speaker's identification based on this minimized distance.
Non-patent literature 1: P.91-93 first published of Co., Ltd. is published in the gloomy north of Furui Sadaoki's work " sound voice and sentiment Reported processing "
Summary of the invention
But, because the minimizing of the distance between two characteristic parameter time serieses that existing DP coupling carries out comparing, therefore inappropriate as method in the hope of the Speaker Identification that is not all purpose of speaker's voice.That is, owing to the flexible time construction that destroys in the peculiar word of speaker of superfluous time, as a result of, existence can not fully be reflected in the difference between the speaker the last problem of distance.In order to address this problem, also to carry out the flexible method (matching window) that restriction is set of time, but in the method, have the problem of carrying out the danger of the correspondence between different harmonious sounds between same speaker that occurs on the contrary.These problems since the distance of asking the employed distance of optimization and being used in DP path to differentiate the speaker by same computing method cause, be difficult to carry out the high Speaker Identification of precision.
The objective of the invention is to the Speaker Identification that realizes that precision is high.
The present invention is based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carry out the Speaker Identification device of speaker's identification, it is characterized in that, described Speaker Identification device comprises: set the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series are set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the parts of the summation of this first distance; Ask best match sequence so that the summation of described first distance becomes minimum parts; Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the parts of the summation of this second distance; And the parts that carry out speaker's identification based on the summation of the described second distance of obtaining.
From another point of view, the present invention is a program of carrying out the embodied on computer readable of Speaker Identification function, this Speaker Identification function is based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carry out speaker's identification, it is characterized in that, described program makes described computing machine carry out following function: set the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series are set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the function of the summation of this first distance; Ask best match sequence so that the summation of described first distance becomes minimum function; Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the function of the summation of this second distance; And the function of carrying out speaker's identification based on the summation of the described second distance of obtaining.
From another point of view, the present invention is based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carry out the method for distinguishing speek person of speaker's identification, it is characterized in that, described Speaker Identification device comprises: set the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series are set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the step of the summation of this first distance; Ask best match sequence so that the summation of described first distance becomes minimum step; Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the step of the summation of this second distance; And the step of carrying out speaker's identification based on the summation of the described second distance of obtaining.
Description of drawings
Fig. 1 is the block scheme of structure of the Speaker Identification device of expression first embodiment of the present invention.
Fig. 2 is the block scheme of structure of the both speaker. identification unit that has of Speaker Identification device of expression first embodiment of the present invention.
Fig. 3 is the block scheme of structure of the both speaker. identification unit that has of Speaker Identification device of expression second embodiment of the present invention.
Fig. 4 is the synoptic diagram of the structure of representation feature parameter.
Fig. 5 is expression is realized the structure example of Speaker Identification device under the situation of the present invention by software a block scheme.
Fig. 6 is the block scheme of structure of the part of the existing Speaker Identification device of expression.
Embodiment
Based on Fig. 1 and Fig. 2 first embodiment of the present invention is described.Fig. 1 is the block scheme of structure of the Speaker Identification device of expression present embodiment, and Fig. 2 is the block scheme of the structure of the both speaker. identification unit that has of expression Speaker Identification device.The Speaker Identification device of present embodiment is an example of the Speaker Identification device of text dependent form.
As shown in Figure 1, Speaker Identification device 100 comprises microphone 1, low-pass filter 2, A/D converting unit 3, characteristic parameter generation unit 4, both speaker. identification unit 5, speaker model generation unit 6 and storage unit 7.
Microphone 1 is that the phonetic modification that will be transfused to is the converter unit of electric analoging signal.Low-pass filter 2 is wave filters that the frequency more than the truncation specification frequency is also exported from the simulating signal that is transfused to.A/D converting unit 3 is the simulating signals that will be transfused to the sample frequency of regulation, the converter unit that quantization digit is transformed to digital signal.Be configured for importing the phonetic entry parts of voice by these microphones 1, low-pass filter 2, A/D converting unit 3.
Characteristic parameter generation unit 4 is to extract the characteristic parameter that comprises individual information from the digital signal that has been transfused to successively, and the generation output unit exported of generating feature parameter time series (eigenvector row) back.In the present embodiment, the sound wave between 4 pairs of ensonified zone of characteristic parameter generation unit carries out the frame analysis, thereby asks Δ spacing and 16 cepstrum coefficients, generates the characteristic parameter time series that is made of Δ time interval sequence and 16 cepstrum coefficient time serieses.In addition, cepstrum coefficient seasonal effect in time series number of times is not defined as 16 times.
Speaker model generation unit 6 is the generation units that generate speaker model according to characteristic parameter time series that is generated by characteristic parameter generation unit 4 and registration speaker's ID.Storage unit 7 is storage (registration) storage unit by the speaker model of speaker model generation unit 6 generations.In the present embodiment, speaker model by registered in advance in storage unit 7.
Both speaker. identification unit 5 calculates by the characteristic parameter time series of characteristic parameter generation unit 4 generations and the distance of the speaker model of registered in advance in storage unit 7, carry out speaker's identification based on this distance, and this recognition results is exported as the Speaker Identification result.
Such both speaker. identification unit 5 comprises metrics calculation unit 12 and identification unit 13 between DP matching unit 11, the speaker as shown in Figure 2.Carry out various parts (or step) by these each unit.
To the 12 difference input characteristic parameter time series A of metrics calculation unit between DP matching unit 11 and the speaker, B.Characteristic parameter time series A, B comprise Δ time interval sequence.In addition, in the present embodiment, characteristic parameter time series A is the characteristic that generates according to the sound wave from microphone 1 input, and characteristic parameter time series B is the characteristic that is registered in the speaker model in the storage unit 7.Here, characteristic parameter time series A is the first speech characteristic parameter time series, and characteristic parameter time series B is the second speech characteristic parameter time series.Represent such characteristic parameter time series A, B below.
The characteristic parameter time series
A=α 1,α 2,…,α i,…,α I
B=β 1,β 2,…,β j,…,β J
Characteristic
α i=p i,α i1,α i2,…,α ik,…,α i16
β j=q j,β j1,β j2,…,β jk,…,β j16
PARAMETER ALPHA i, β jBe that the sound wave between the ensonified zone is carried out the frame analysis and the Δ spacing (p that obtains i, q j) and 16 cepstrum coefficient (α I1~α I16, β J1~β J16) constitute.Thereby characteristic parameter time series A, B are made of Δ time interval sequence and 16 cepstrum coefficient time serieses.Here, relatively the Δ spacing comprises more harmonious sounds information, and cepstrum coefficient comprises more individual information.
DP matching unit 11 carries out the DP matching treatment, so that corresponding between the harmonious sounds of two characteristic parameter time series A, B.At this moment, carry out optimization, so that (i, summation D j) (F) are minimum, and ask best match sequence F apart from d as the harmonious sounds of first distance by the DP matching algorithm.
Here, best match sequence F is as corresponding factor c of time nSequence be defined like this suc as formula (1), (i j) uses the Δ spacing to be defined as following formula (1) to the harmonious sounds between each characteristic parameter, and summation D (F) is defined as following formula (3) apart from d.That is, best match sequence F, harmonious sounds apart from d (i, j) and summation D (F) obtained by following formula (1), formula (2) and formula (3) respectively.
[formula 1]
F=c 1,c 2,---,c n,---,?c N,c n=(i n,j n) ····(1)
[formula 2]
d(i,j)=|p i-q j| ····(2)
[formula 3]
D ( F ) = 1 I + J Σ n = 1 N d ( c n ) = 1 I + J Σ n = 1 N d ( i n , j n ) - - - ( 3 )
As be described in detail, DP matching unit 11 uses the Δ time interval sequence separately of two characteristic parameter time series A, B, and through type (2) is asked harmonious sounds, and (i, j), and through type (3) is asked its summation D (F) apart from d.At this moment, through type (3) and formula (1) are carried out optimization, so that summation D (F) is minimum, thereby ask best match sequence F.Here, Δ time interval sequence is the first speech characteristic parameter group.
Metrics calculation unit 12 is used the best match sequence F that is obtained by DP matching unit 11 between the speaker, and calculating is individual apart from e (i, summation E j) (F) as second distance.It is here, individual that (i j) is defined as following formula (4), and summation E (F) is defined as following formula (5) apart from e.That is, individual apart from e (i, j) and summation E (F) obtained by following formula (4) and formula (5) respectively.
[formula 4]
e ( i , j ) = [ Σ k = 1 16 ( α ik - β jk ) 2 ] 1 2 - - - ( 4 )
[formula 5]
E ( F ) = 1 I + J Σ n = 1 N e ( c n ) = 1 I + J Σ n = 1 N e ( i n , j n ) - - - ( 5 )
As be described in detail, metrics calculation unit 12 is used the cepstrum coefficient time series separately of two characteristic parameter time series A, B between the speaker, through type (4) ask individual apart from e (i, j), and based on best match sequence F, through type (5) is asked its summation E (F).In the present embodiment, as the cepstrum coefficient time series, use 1~16 time cepstrum coefficient time series.In addition, the cepstrum coefficient time series is the second speech characteristic parameter group.
Identification unit 13 carries out speaker's identification based on the summation E (F) of the individual distance of being obtained by metrics calculation unit between the speaker 12, and its recognition results is exported as the Speaker Identification result.Here, for example summation E (F) is compared with threshold value, carry out the judgement (speaker's contrast) of both speaker. identification.
Like this, according to present embodiment, use the Δ time interval sequence separately of two characteristic parameter time series A, B to ask the summation D (F) of harmonious sounds distance to be minimum optimal spacing sequence F, use the cepstrum coefficient time series separately of this best match sequence and two characteristic parameter time series A, B to ask the summation E (F) of individual distance, carry out speaker's identification based on this summation E (F).Thus, harmonious sounds decomposability when speech characteristic parameter time series A, B are mated and ask between the speech characteristic parameter time series apart from the time speaker's decomposability and deposit, stable recognition performance can be guaranteed, therefore the Speaker Identification that precision is high can be realized.In addition, the employed distance of the optimization in DP path and being used to is differentiated speaker's distance and is obtained with diverse ways, therefore it is last the difference between the speaker can be reflected to fully distance, in addition because can be in the correspondence that suppresses between same speaker between different harmonious sounds, so can realize the Speaker Identification that precision is high.
Here,, at the many positions of the variable quantity of characteristic parameter coupling takes place and departs from the possibility height of (time departs from) mutually independently under the situation at harmonious sounds distance and the employed characteristic parameter of individual distance.In this case, like this, (i, j) " on average " that distortion applies a little as following formula (6) acts on, and departs from thereby can improve coupling apart from e with harmonious sounds shown in following formula (6).
[formula 6]
e ( i , j ) = min { [ Σ k = 1 16 ( α ik - β ( j - 1 ) k ) 2 ] 1 2 , [ Σ k = 1 16 ( α ik - β jk ) 2 ] 1 2 , [ Σ k = 1 16 ( α ik - β ( j + 1 ) k ) 2 ] 1 2 } - - - ( 6 )
In addition, by carrying out above-mentioned " on average " effect mutually, can obtain more stable harmonious sounds distance.In this case, (i j) is out of shape as following formula (7) apart from e with harmonious sounds.
It is average that mean distance is defined as both sides' addition.
[formula 7]
e ( i , j ) = 1 2 [ min { Σ k = 1 16 ( α ik - β ( j - 1 ) k ) 2 ] 1 2 , [ Σ k = 1 16 ( α ik - β jk ) 2 ] 1 2 , [ Σ k = 1 16 ( α ik - β ( j + 1 ) k ) 2 ] 1 2 }
+ min { [ Σ k = 1 16 ( α ( i - 1 ) k - β jk ) 2 ] 1 2 , [ Σ k = 1 16 ( α ik - β jk ) 2 ] 1 2 , [ Σ k = 1 16 ( α ( i + 1 ) k - β jk ) 2 ] 1 2 } ] - - - ( 7 )
In the present embodiment, comprise the basic frequency information time sequence that obtains from the basic frequency of voice as the first speech characteristic parameter seasonal effect in time series characteristic parameter time series A and as the second speech characteristic parameter seasonal effect in time series characteristic parameter time series B, and the sympathetic response information time sequence that obtains from the sympathetic response information of sound channel, the first speech characteristic parameter group is a basic frequency information time sequence, the second speech characteristic parameter group is a sympathetic response information time sequence, therefore can realize high-precision Speaker Identification reliably.
In the present embodiment, characteristic parameter time series A and characteristic parameter time series B comprise the Δ time interval sequence that obtains from the information of rising and falling of voice, and the cepstrum coefficient time series that obtains from the sympathetic response information of sound channel, pass through apart from e apart from d and as the individual of second distance as the harmonious sounds of first distance
[formula 8]
d=|p k-q k|
e = [ Σ k = k 0 k ( a k - b k ) 2 ] 1 2
k0≥1
D, e: first distance, second distance
P: the first speech characteristic parameter seasonal effect in time series Δ spacing
Q: the second speech characteristic parameter seasonal effect in time series Δ spacing
a k: the first speech characteristic parameter seasonal effect in time series cepstrum coefficient
b k: the second speech characteristic parameter seasonal effect in time series cepstrum coefficient
K: cepstrum number of times
And obtained, therefore can realize the Speaker Identification that precision is high more reliably.
In the present embodiment, i the PARAMETER ALPHA of characteristic parameter time series A iJ characteristic parameter β with characteristic parameter time series B jIndividual (i j) passes through apart from e
[formula 9]
e ( i , j ) = min dist ( i , j - L ) dist ( i , j - L + 1 ) M dist ( i , j ) M dist ( i , j + L - 1 ) dist ( i , j + L )
Dist (X, Y): the distance of speech characteristic parameter X and Y
L: mean breadth (>0)
And obtained, therefore can improve coupling and depart from.
In addition, i the PARAMETER ALPHA of characteristic parameter time series A iJ characteristic parameter β with characteristic parameter time series B jIndividual (i j) passes through apart from e
[formula 10]
e ( i , j ) = 1 2 [ min dist ( i - L , j ) dist ( i - L + 1 , j ) M dist ( i , j ) M dist ( i + L - 1 , j ) dist ( i + L , j ) + min dist ( i , j - L ) dist ( i , j - L + 1 ) M dist ( i , j ) M dist ( i , j + L - 1 ) dist ( i , j + L ) ]
Dist (X, Y): the distance of speech characteristic parameter X and Y
L: mean breadth (>0)
And when being obtained, can obtain more stable harmonious sounds distance.
Based on Fig. 3 and Fig. 4 second embodiment of the present invention is described.Fig. 3 is the block scheme of structure of the both speaker. identification unit that has of Speaker Identification device of expression present embodiment, and Fig. 4 is the synoptic diagram of the structure of representation feature parameter.
Present embodiment is the variation of the both speaker. identification unit 5 shown in first embodiment.In addition, the part identical with described first embodiment represents that with same-sign the explanation beyond the both speaker. identification unit 5 is omitted.In addition, in the present embodiment, the sound wave between 4 pairs of ensonified zone of characteristic parameter generation unit carries out the frame analysis and asks cepstrum coefficient 16 times, generates the characteristic parameter time series that is made of 16 cepstrum coefficients.In addition, cepstrum coefficient seasonal effect in time series number of times is not limited to 16 times.
As shown in Figure 3, both speaker. identification unit 5 is same with first embodiment basically, comprises metrics calculation unit 12 and identification unit 13 between DP matching unit 11, the speaker.Carry out various parts (or step) by these each unit.
To the 12 difference input characteristic parameter time series A of metrics calculation unit between DP matching unit 11 and the speaker, B.In addition, in the present embodiment, characteristic parameter time series A is the characteristic that generates according to the sound wave from microphone 1 input, and characteristic parameter time series B is the characteristic that is registered in the speaker model in the storage unit 7.Here, characteristic parameter time series A is the first speech characteristic parameter time series, and characteristic parameter time series B is the second speech characteristic parameter time series.Represent such characteristic parameter time series A, B below.
The characteristic parameter time series
A=α 1,α 2,…,α i,…,α I
B=β 1,β 2,…,β j,…,β J
Characteristic
α i=α i1,α i2,…,α ik,…,α i16
β j=β j1,β j2,…,β jk,…,β j16
PARAMETER ALPHA i, β jBe to the sound wave between the ensonified zone carry out the frame analysis and obtain by 16 cepstrum coefficient (α I1~α I16, β J1~β J16) constitute.Thereby characteristic parameter time series A, B are the time serieses of 16 cepstrum coefficients.In addition, here, 1~8 time cepstrum coefficient time series is the cepstrum coefficient time series of low order, and the inferior cepstrum coefficient time series in m~16 (m>8) is the cepstrum coefficient time series of high order.
DP matching unit 11 carries out the DP matching treatment, so that corresponding between the harmonious sounds of two characteristic parameter time series A, B.At this moment, carry out optimization, so that (i, summation D j) (F) are minimum, and ask best match sequence F apart from d as the harmonious sounds of first distance by the DP matching algorithm.
Here, best match sequence F is as corresponding factor c of time nSequence be defined like this suc as formula (1), (i j) uses the cepstrum coefficient of low order to be defined as following formula (8) to the harmonious sounds between each characteristic parameter, and summation D (F) is defined as following formula (3) apart from d.That is, best match sequence F, harmonious sounds apart from d (i, j) and summation D (F) obtained by following formula (1), formula (8) and formula (3) respectively.
[formula 11]
F=c 1,c 2,---,c n,---,c N,c n=(i n,j n)····(1)
[formula 12]
d ( i , j ) = [ Σ k = 1 B ( α ik - β jk ) 2 ] 1 2 - - - ( 8 )
[formula 13]
D ( F ) = 1 I + J Σ n = 1 N d ( c n ) = 1 I + J Σ n = 1 N d ( i n , j n ) - - - ( 3 )
As be described in detail, DP matching unit 11 uses the cepstrum coefficient time series (1~8 cepstrum coefficient time series) of the low order separately of two characteristic parameter time series A, B, through type (8) is asked harmonious sounds, and (i, j), and through type (3) is asked its summation D (F) apart from d.At this moment, through type (3) and formula (1) are carried out optimization, so that summation D (F) is minimum, thereby ask best match sequence F.Here, the cepstrum coefficient time series of low order is the first speech characteristic parameter group.
Metrics calculation unit 12 is used the best match sequence F that is obtained by DP matching unit 11 between the speaker, calculates as individual apart from e (i, summation E j) (F).It is here, individual that (i j) is defined as following formula (4), and summation E (F) is defined as following formula (5) apart from e.That is, individual apart from e (i, j) and summation E (F) obtained by following formula (4) and formula (5) respectively.
[formula 14]
e ( i , j ) = [ Σ k = 1 16 ( α ik - β jk ) 2 ] 1 2 - - - ( 4 )
[formula 15]
E ( F ) = 1 I + J Σ n = 1 N e ( c n ) = 1 I + J Σ n = 1 N e ( i n , j n ) - - - ( 5 )
As be described in detail, metrics calculation unit 12 is used the cepstrum coefficient time series of the cepstrum coefficient time series (the inferior cepstrum coefficient time series in m~16 (m>8)) of the high order separately that comprises two characteristic parameter time series A, B between the speaker, it is individual apart from e (i that through type (4) is asked, j), and based on best match sequence F, through type (5) is asked its summation E (F).In the present embodiment, as the cepstrum coefficient time series, use 1~16 time cepstrum coefficient time series.Here, the cepstrum coefficient of high order generally comprises more individual information than the cepstrum coefficient of low order.In addition, the cepstrum coefficient time series is the second speech characteristic parameter group.
Here, as shown in Figure 4, in having 1~N time the characteristic parameter of cepstrum coefficient, under with 1~n time the situation of cepstrum coefficient as the cepstrum coefficient (Fig. 4 (a) bend part) of low order, the cepstrum coefficient of high order is m~N (cepstrum coefficient that m>n) is inferior.The cepstrum coefficient of this high order is the cepstrum coefficient time series of high order by the sequence of time seriesization.Thereby, the cepstrum coefficient seasonal effect in time series cepstrum coefficient time series that comprises high order also can be only by m~N (time series that the inferior cepstrum coefficient (netting twine part among Fig. 4 (b)) of m>n) constitutes, perhaps also can be by m~N (time series that the part of the cepstrum coefficient that m>n) is inferior and the cepstrum coefficient of low order (netting twine part among Fig. 4 (c)) constitutes, and then also can be time series by 1~N time cepstrum coefficient (netting twine part among Fig. 4 (d)) formation.In addition, in the present embodiment, be set at N=16 and n=8, but be not limited thereto.
Identification unit 13 carries out speaker's identification based on the summation E (F) of the individual distance of being obtained by metrics calculation unit between the speaker 12, and its recognition results is exported as the Speaker Identification result.Here, for example summation E (F) is compared with threshold value, carry out the judgement (speaker's contrast) of both speaker. identification.
Like this, according to present embodiment, use the cepstrum coefficient time series of the low order separately of two characteristic parameter time series A, B to ask the summation D (F) of harmonious sounds distance to be minimum optimal spacing sequence F, use this best match sequence and the cepstrum coefficient seasonal effect in time series cepstrum coefficient time series that comprises the high order separately of two characteristic parameter time series A, B to ask the summation E (F) of individual distance, carry out speaker's identification based on this summation E (F).Thus, harmonious sounds when speech characteristic parameter time series A, B are mated explanation performance and ask between the speech characteristic parameter time series apart from the time the speaker offer an explanation performance and deposit, stable recognition performance can be guaranteed, therefore the Speaker Identification that precision is high can be realized.In addition, the employed distance of the optimization in DP path and being used to is differentiated speaker's distance and is obtained with diverse ways, therefore it is last the difference between the speaker can be reflected to fully distance, in addition because can be in the correspondence that suppresses between same speaker between different harmonious sounds, so can realize the Speaker Identification that precision is high.
In the present embodiment, as the first speech characteristic parameter seasonal effect in time series characteristic parameter time series A and as the second speech characteristic parameter seasonal effect in time series characteristic parameter time series B is the cepstrum coefficient time series that the sympathetic response information from sound channel obtains, the first speech characteristic parameter group is the cepstrum coefficient time series of the low order in the cepstrum coefficient time series, the second speech characteristic parameter group is the cepstrum coefficient seasonal effect in time series cepstrum coefficient time series that comprises the high order in the cepstrum coefficient time series, therefore can realize high-precision Speaker Identification reliably.
In the present embodiment, as the first speech characteristic parameter seasonal effect in time series characteristic parameter time series A and as the second speech characteristic parameter seasonal effect in time series characteristic parameter time series B is the cepstrum coefficient time series that the sympathetic response information from sound channel obtains, and passes through apart from e apart from d and as the individual of second distance as the harmonious sounds of first distance
[formula 16]
d = [ Σ k = 1 N ( a k - b k ) 2 ] 1 2
e = [ Σ k = k 0 M ( a k - b k ) 2 ] 1 2
N<M,k0≥1
D, e: first distance, second distance
a k: the first speech characteristic parameter seasonal effect in time series cepstrum coefficient
b k: the second speech characteristic parameter seasonal effect in time series cepstrum coefficient
K: cepstrum number of times
And obtained, therefore can realize high-precision Speaker Identification reliably.
In addition, the present invention is not limited to the specific hardware configuration shown in the embodiment as the aforementioned, also can realize with software.That is, available software realizes the function (Speaker Identification function) of both speaker. identification unit 5.Fig. 5 is expression is realized the structure example of the Speaker Identification device 100 under the situation of the present invention by software a block scheme.
As shown in Figure 5, Speaker Identification device 100 comprises the CPU101 of the each several part of this Speaker Identification device 100 of centralized control, this CPU101 goes up by bus and connects ROM that has stored BIOS etc. or the storer 102 that is made of the RAM that can rewrite the ground store various kinds of data, constitutes microcomputer.In addition, CPU101 is last to be connected with HDD (Hard Disk Drive via not shown I/O bus, hard disk drive) 103, to the CD CD-ROM drive 105 that (Compact Disc, CD)-ROM104 reads of the storage medium of embodied on computer readable, be responsible for display device 108, the microphones 1 such as the communicator of communicating by letter 106, keyboard 107, CRT or LCD of Speaker Identification device 100 and the Internet etc.
Stored in the storage medium of embodied on computer readable such as CD-ROM104 and realized Speaker Identification functional programs of the present invention,, can make CPU101 carry out Speaker Identification function of the present invention by this program is installed in the Speaker Identification device 100.In addition, from the voice of microphone 1 input are stored in HDD103 etc. temporarily.Then, when program was started, the interim speech data of preserving was read among the HDD103 etc., carried out Speaker Identification and handled.This Speaker Identification is handled and is realized and the same functions of each several part such as characteristic parameter generation unit 4 or both speaker. identification unit 5.Thus, can obtain the effect same with the effect of described embodiment.
In addition,, not only can use CD-ROM104, also can use the medium of variety of ways such as various disks, semiconductor memory such as various CDs, various photomagneto disk, floppy disk such as DVD as storage medium.In addition, also can be etc. from the Internet network download and being installed in the HDD103.In this case, the memory storage of having stored program in the server as transmitting terminal also becomes storage medium of the present invention.In addition, program can be OS (the Operating System in regulation, operating system) go up the program of moving, in this case, also can be the program of the execution of each part of handling described later being transferred to OS, also can be the program that comprises as the part of the batch processing file of the application software of regulations such as word processor or formation OS etc.

Claims (8)

1. a Speaker Identification device based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carries out speaker's identification, it is characterized in that described Speaker Identification device comprises:
Setting makes the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the parts of the summation of this first distance;
Ask best match sequence so that the summation of described first distance becomes minimum parts;
Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the parts of the summation of this second distance; And
Carry out the parts of speaker's identification based on the summation of the described second distance of obtaining.
2. Speaker Identification device as claimed in claim 1 is characterized in that,
Described first speech characteristic parameter time series and the described second speech characteristic parameter time series comprise the basic frequency information time sequence that obtains from the basic frequency of voice, and the sympathetic response information time sequence that obtains from the sympathetic response information of sound channel,
The described first speech characteristic parameter group is described basic frequency information time sequence,
The described second speech characteristic parameter group is described sympathetic response information time sequence.
3. Speaker Identification device as claimed in claim 1 is characterized in that,
Described first speech characteristic parameter time series and the described second speech characteristic parameter time series are the cepstrum coefficient time serieses that the sympathetic response information from sound channel obtains,
The described first speech characteristic parameter group is the cepstrum coefficient time series of the low order in the described cepstrum coefficient time series,
The described second speech characteristic parameter group is the cepstrum coefficient seasonal effect in time series cepstrum coefficient time series that comprises the high order in the described cepstrum coefficient time series.
4. Speaker Identification device as claimed in claim 1 is characterized in that,
Described first speech characteristic parameter time series and the described second speech characteristic parameter time series comprise the Δ time interval sequence that obtains from the information of rising and falling of voice, and the cepstrum coefficient time series that obtains from the sympathetic response information of sound channel,
Described first passes through apart from d and described second distance e
[formula 1]
d=|p k-q k|
e = [ Σ k = k 0 k ( a k - b k ) 2 ] 1 2
k0≥1
D, e: first distance, second distance
P: the first speech characteristic parameter seasonal effect in time series Δ spacing
Q: the second speech characteristic parameter seasonal effect in time series Δ spacing
a k: the first speech characteristic parameter seasonal effect in time series cepstrum coefficient
b k: the second speech characteristic parameter seasonal effect in time series cepstrum coefficient
K: cepstrum number of times
And obtained.
5. Speaker Identification device as claimed in claim 1 is characterized in that,
(i j) passes through the described second distance e of described i speech characteristic parameter of the first speech characteristic parameter seasonal effect in time series and j speech characteristic parameter of the described second speech characteristic parameter seasonal effect in time series
[formula 2]
e ( i , j ) = min dist ( i , j - L ) dist ( i , j - L + 1 ) M dist ( i , j ) M dist ( i , j + L - 1 ) dist ( i , j + L )
Dist (X, Y): the distance of speech characteristic parameter X and Y
L: mean breadth, and greater than 0
And obtained.
6. Speaker Identification device as claimed in claim 1 is characterized in that,
(i j) passes through the described second distance e of described i speech characteristic parameter of the first speech characteristic parameter seasonal effect in time series and j speech characteristic parameter of the described second speech characteristic parameter seasonal effect in time series
[formula 3]
e ( i , j ) = 1 2 min dist ( i - L , j ) dist ( i - L + 1 , j ) M dist ( i , j ) M dist ( i + L - 1 , j ) dist ( i + L , j ) + min dist ( i , j - L ) dist ( i , j - L + 1 ) M dist ( i , j ) M dist ( i , j + L - 1 ) dist ( i , j + L )
Dist (X, Y): the distance of speech characteristic parameter X and Y
L: mean breadth, and greater than 0
And obtained.
7. Speaker Identification device as claimed in claim 1 is characterized in that,
Described first speech characteristic parameter time series and the described second speech characteristic parameter time series are the cepstrum coefficient time serieses that the sympathetic response information from sound channel obtains,
Described first passes through apart from d and described second distance
[formula 4]
d = [ Σ k = 1 N ( a k - b k ) 2 ] 1 2
e = [ Σ k = k 0 M ( a k - b k ) 2 ] 1 2
N<M,k0≥1
D, e: first distance, second distance
a k: the first speech characteristic parameter seasonal effect in time series cepstrum coefficient
b k: the second speech characteristic parameter seasonal effect in time series cepstrum coefficient
K: cepstrum number of times
And obtained.
8. a method for distinguishing speek person based on the distance between the first speech characteristic parameter time series and the second speech characteristic parameter seasonal effect in time series time series, carries out speaker's identification, it is characterized in that described method for distinguishing speek person comprises:
Set the described first speech characteristic parameter time series and described second each speech characteristic parameter of speech characteristic parameter seasonal effect in time series are set up corresponding matching sequence mutually, use the first speech characteristic parameter group separately, ask first distance between each speech characteristic parameter according to described matching sequence, and ask the step of the summation of this first distance;
Ask best match sequence so that the summation of described first distance becomes minimum step;
Use described first speech characteristic parameter time series and the described second speech characteristic parameter time series second speech characteristic parameter group separately, according to described best match sequence, ask the second distance between each speech characteristic parameter, and ask the step of the summation of this second distance; And
Carry out the step of speaker's identification based on the summation of the described second distance of obtaining.
CN200580003955A 2004-06-01 2005-05-31 Speaker recognizing device and speaker recognizing method Expired - Fee Related CN100593194C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP163071/2004 2004-06-01
JP2004163071A JP3927559B2 (en) 2004-06-01 2004-06-01 Speaker recognition device, program, and speaker recognition method

Publications (2)

Publication Number Publication Date
CN1914667A CN1914667A (en) 2007-02-14
CN100593194C true CN100593194C (en) 2010-03-03

Family

ID=35463096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200580003955A Expired - Fee Related CN100593194C (en) 2004-06-01 2005-05-31 Speaker recognizing device and speaker recognizing method

Country Status (3)

Country Link
JP (1) JP3927559B2 (en)
CN (1) CN100593194C (en)
WO (1) WO2005119654A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102354496B (en) * 2011-07-01 2013-08-21 中山大学 PSM-based (pitch scale modification-based) speech identification and restoration method and device thereof
CN103730121B (en) * 2013-12-24 2016-08-24 中山大学 A kind of recognition methods pretending sound and device
JP6946499B2 (en) * 2020-03-06 2021-10-06 株式会社日立製作所 Speech support device, speech support method, and speech support program

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0792678B2 (en) * 1985-12-17 1995-10-09 株式会社東芝 Voice pattern matching method
JP2543528B2 (en) * 1987-06-29 1996-10-16 沖電気工業株式会社 Voice recognition device
US5522012A (en) * 1994-02-28 1996-05-28 Rutgers University Speaker identification and verification system
JPH0786759B2 (en) * 1994-03-14 1995-09-20 株式会社東芝 Dictionary learning method for voice recognition
JPH1020883A (en) * 1996-07-02 1998-01-23 Fujitsu Ltd User authentication device
JPH1097274A (en) * 1996-09-24 1998-04-14 Kokusai Denshin Denwa Co Ltd <Kdd> Method and device for recognizing speaker
JP2001034294A (en) * 1999-07-21 2001-02-09 Matsushita Electric Ind Co Ltd Speaker verification device

Also Published As

Publication number Publication date
JP3927559B2 (en) 2007-06-13
CN1914667A (en) 2007-02-14
WO2005119654A1 (en) 2005-12-15
JP2005345598A (en) 2005-12-15

Similar Documents

Publication Publication Date Title
CN108766419B (en) Abnormal voice distinguishing method based on deep learning
US6535852B2 (en) Training of text-to-speech systems
AU2002311452B2 (en) Speaker recognition system
Chan Using a test-to-speech synthesizer to generate a reverse Turing test
CN100593194C (en) Speaker recognizing device and speaker recognizing method
JP3130524B2 (en) Speech signal recognition method and apparatus for implementing the method
JP2009086581A (en) Apparatus and program for creating speaker model of speech recognition
CN113539243A (en) Training method of voice classification model, voice classification method and related device
Abdullaeva et al. Formant set as a main parameter for recognizing vowels of the Uzbek language
Büyük et al. An Investigation of Multi-Language Age Classification from Voice.
CN110767238B (en) Blacklist identification method, device, equipment and storage medium based on address information
JP5749186B2 (en) Acoustic model adaptation device, speech recognition device, method and program thereof
Baby Investigating modulation spectrogram features for deep neural network-based automatic speech recognition
Singh et al. A novel algorithm using MFCC and ERB gammatone filters in speech recognition
Jiaqi et al. Research on intelligent voice interaction application system based on NAO robot
JP2561553B2 (en) Standard speaker selection device
Paulose et al. A comparative study of text-independent speaker recognition systems using Gaussian mixture modeling and i-vector methods
Fathoni et al. Optimization of Feature Extraction in Indonesian Speech Recognition Using PCA and SVM Classification
Anderson Auditory models with Kohonen SOFM and LVQ for speaker independent phoneme recognition
Saeta et al. New speaker-dependent threshold estimation method in speaker verification based on weighting scores
Swain et al. Supervised and Unsupervised Data Mining Techniques for Speaker Verification Using Prosodic+ Spectral Features
JP2005345683A (en) Speaker-recognizing device, program, and speaker-recognizing method
Patil et al. Teager energy mel cepstrum for identification of twins in Marathi
CN114203198A (en) Bi-quad type sound detection system
Blomberg et al. Investigating explicit model transformations for speaker normalization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100303

Termination date: 20130531