CN104167206A

CN104167206A - Acoustic model combination method and device, and voice identification method and system

Info

Publication number: CN104167206A
Application number: CN201310182399.5A
Authority: CN
Inventors: 刘贺飞; 郭莉莉
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-05-17
Filing date: 2013-05-17
Publication date: 2014-11-26
Anticipated expiration: 2033-05-17
Also published as: CN104167206B

Abstract

The invention relates to an acoustic model combination method and device, and a voice identification method and system. The acoustic model combination method that is used for combining a plurality of acoustic models including a first acoustic model and a second acoustic model comprises the following steps: a distribution information obtaining step; to be specific, obtaining distribution information, being capable of reflecting an importance degree of a modeling unit in a to-be-identified language, of modeling units of at least first acoustic model and the second acoustic model or a model unit of at least the first acoustic model or the second acoustic model; a distance calculation step; to be specific, respectively calculating distances of this type of model forming element pairs formed by model forming elements with the same type of the first acoustic model and the second acoustic model; a weighting step; to be specific, carrying out weighting processing on the distances of this type of corresponding model forming element pairs by using the distribution information; a sorting step; to be specific, sorting of this type of model forming element pairs based on the weighted distances; and a combination step; to be specific, according to the sorting result, combining the first acoustic model and the second acoustic model so as to obtain a combined acoustic model.

Description

Acoustic model merges method and apparatus and audio recognition method and system

Technical field

Relate generally to of the present invention, for the merging method of the acoustic model of automatic speech recognition (ASR) with for merging equipment and audio recognition method and the speech recognition system of the acoustic model of automatic speech recognition, relates to particularly the method and apparatus for merging multiple acoustic models and utilizes audio recognition method and the system of the acoustic model after merging.

Background technology

Acoustic model is one of most important part in speech recognition system.In speech recognition system, in order to ensure the accuracy of identification, conventionally need to use multiple acoustic models (acoustic model, AM), for example,, such as, for the different AM of different modeling unit (phoneme, word, word, initial consonant, simple or compound vowel of a Chinese syllable etc.), for the different AM of different language, for the different AM(of varying environment for example, the AM obtaining under the AM that obtains under quiet environment, noisy environment etc.) etc.

How reducing the size of acoustic model, is an important problem in speech recognition technology.

In order to reduce the size of acoustic model, the way conventionally adopting is: for example, share the parameter of different acoustic models by different criterion (, the criterion of data-driven or rule-based criterion) to reach the object that merges these acoustic models.

Wherein, comprise for the parameter of the modeling unit that forms acoustic model: average, variance, Gaussian Mixture (mixture), state, hidden Markov model (HMM) etc.; And in the parameter of the modeling unit for forming acoustic model, Gaussian Mixture comprises variance, state comprises one or more Gaussian Mixture, and hidden markov model comprises one or more states.And, the model of phoneme (can comprise such as single-tone element, diphones, triphones etc.) (for example can represent with hidden markov model, each phoneme can be represented by three condition hidden Markov model), thus, the actual pronunciation of the word in language can be represented as hidden Markov model sequence.These parameters belong to the different classes of parameter of acoustic model.These different classes of parameters can both be served as sharable parameter, for example, and shared variance, shared Gaussian Mixture, shared state, shared hidden Markov model or phoneme etc.

Several known big or small methods that reduce acoustic model will be exemplified particularly below.

A kind of method is the acoustic model merging method based on distance.In the method, two Gaussian Mixture that its in different acoustic models distance is less or state are by merged, to reduce thus the size of acoustic model.

Also proposed a kind of acoustic model merging method based on decision tree, this is a kind of method based on data-driven, is known and conventional, and can be called as binding (tied) Gaussian Mixture or bundle status method.Decision tree represents the relation of equivalence between the HMM parameter in different model states or phoneme context (context).The performance height of decision tree depend on data number and data distribute.Particularly, for example, it is suitable needing the data volume of acoustic model to be combined,, need to keep the balance between the data volume of these acoustic models that is.For example, for multiple similar phoneme contexts, may cause because training data is not enough each in these phoneme contexts to train exactly.Therefore,, for modeling, can make them tie up in groups or by them.Through above-mentioned in groups or after bundling operation, the decreased number of the independent parameter in model.

Although above-mentioned these methods can reduce the size of acoustic model, for example exist problem as follows.

About the above-mentioned acoustic model merging method based on distance, the method is not considered the importance of the different phonemes in different language.For example, in each language, the probability that certain phoneme occurs is very high, and the importance of this phoneme in this language is just higher, or the pronunciation of certain phoneme variation is many, and the importance of this phoneme is just higher.The significance level of identical phoneme in different language is also different.For example, " l " in Japanese is equivalent to " l " and " r " in Chinese, and therefore, comparatively speaking, " l " in Japanese is more important than " l " in Chinese.Again for example, " i " in Chinese has three allophones, and therefore, " i " in Chinese may be more important than " i " in other language.Again for example, for the dialect that is difficult to the Chinese certain areas of distinguishing " z ", " c ", " s " and " zh ", " ch ", " sh ", " h " and residing position thereof may be more important than " h " in other language and residing position thereof.

About the above-mentioned acoustic model merging method based on decision tree, selected in groups or the Gaussian Mixture tying up or status component be the voice class different from two.The performance of decision tree depends on the number of training data and the distribution of training data to heavens.And the method is not considered the importance of the different phonemes in different language yet.

Generally speaking, these classic methods above-mentioned mainly have following two problems:

1) first problem is to be difficult to control the number of training data and the distribution of training data;

2) Second Problem is, owing to not considering the importance of phoneme, therefore, when important state is replaced (, the merga pass shared state of acoustic model realizes) here by other states, the performance of acoustic model can decline.

Particularly, about above-mentioned Second Problem, embodied being entitled as in the method for describing in the U.S. Patent Publication No.US2010/0131262A1 of " based on the speech recognition of multilingual acoustic model ".This U.S. Patent Application Publication a kind of acoustic model merging method based on distance, the method mainly comprises: first, based on criterion collection, by each in the probability distribution function of main acoustic model one probability distribution function of replacing at least one the second acoustic model, or, replace each in the state of Probability State series model of at least one the second acoustic model with the state of the Probability State series model of main acoustic model, to obtain the second acoustic model of at least one amendment, wherein, described criterion collection can be range observation; Then, by the second acoustics models coupling of main acoustic model and at least one amendment, to obtain multilingual acoustic model.

But, in the disclosed method of this U.S. Patent application, do not consider the importance of the different conditions in the second acoustic model.Obviously, as mentioned above, in the time that the important state in the second acoustic model is replaced by the not too important state in main acoustic model, the performance of acoustic model will decline.

Summary of the invention

In summary, thereby need effectively to merge the method and apparatus that acoustic model reduces in right amount the size of acoustic model and significantly do not reduce the performance of acoustic model, thereby make it possible to utilize the acoustic model of the merging obtaining to carry out accurately and efficiently speech recognition.

The present invention is intended to solve above-described problem.Any one acoustic model an object of the present invention is to provide in a kind of overcoming the above problems merges method and apparatus and audio recognition method and system.

Particularly, the invention provides such acoustic model and merge method and apparatus and audio recognition method and system, thereby it can merge the performance that acoustic model reduces in right amount the size of acoustic model and significantly do not reduce acoustic model effectively, thereby make it possible to utilize the acoustic model of the merging obtaining to carry out accurately and efficiently speech recognition.Particularly, the present invention only selects and replaces the not too important model parameter in acoustic model by the consideration based on the importance in the language that will identify for modeling unit, realizes said method and equipment.

According to present disclosure aspect, a kind of acoustic model merging method is provided, for merging the multiple acoustic models that comprise the first acoustic model and the second acoustic model, comprise: distributed intelligence obtaining step, obtain the distributed intelligence of the modeling unit of at least the first and/or second acoustic model, wherein, described distributed intelligence can reflect the significance level of described modeling unit in the language that will identify; Apart from calculation procedure, calculate respectively the right distance of such model-composing key element being formed by the of a sort model-composing key element of the first acoustic model and the second acoustic model; Weighting step, utilizes described distributed intelligence to be weighted for the right distance of each corresponding such model-composing key element; Ordered steps, according to the right of each such model-composing key element of sorting of the distance after weighting; And combining step, according to the result of sequence, by the first acoustic model and the second acoustics model combination, obtain the acoustic model merging.

According to another aspect of present disclosure, provide a kind of acoustic model to merge equipment, for merging the multiple acoustic models that comprise the first acoustic model and the second acoustic model, comprise: distributed intelligence acquiring unit, be configured to the distributed intelligence of the modeling unit that obtains at least the first and/or second acoustic model, wherein, described distributed intelligence can reflect the significance level of described modeling unit in the language that will identify; Metrics calculation unit, is configured to calculate respectively the right distance of such model-composing key element being made up of the of a sort model-composing key element of the first acoustic model and the second acoustic model; Weighted units, is configured to utilize described distributed intelligence to be weighted for the right distance of each corresponding such model-composing key element; Sequencing unit, is configured to according to the right of each such model-composing key element of sorting of the distance after weighting; And merge cells, be configured to according to the result of sequence, by the first acoustic model and the second acoustics model combination, obtain the acoustic model merging.

According to another aspect of present disclosure, a kind of audio recognition method is provided, comprise and carry out speech recognition with the acoustic model that above-mentioned acoustic model merging method obtains.

According to another aspect of present disclosure, provide a kind of and comprise that above-mentioned acoustic model merges the speech recognition system of equipment.

Wherein, described model-composing key element is at least one in average, variance, Gaussian Mixture, state, hidden Markov model.

Wherein, the distributed intelligence of described modeling unit is frequency or the duration that this modeling unit occurs in the training storehouse of the acoustic model corresponding with it.

Read the following description for exemplary embodiment with reference to accompanying drawing, other property features of the present invention and advantage will become clear.

Brief description of the drawings

The accompanying drawing being incorporated in instructions and form a part for instructions shows embodiments of the invention, and together with the description for explaining principle of the present invention.In these accompanying drawings, similarly Reference numeral is for key element like representation class.

Fig. 1 is the block diagram that the exemplary hardware configuration of the computer system that can implement embodiments of the invention is shown.

Fig. 2 exemplarily illustrates according to the process flow diagram of the acoustic model merging method of the first embodiment of the present invention.

Fig. 3 exemplarily illustrates the process flow diagram of acoustic model merging method according to a second embodiment of the present invention.

Fig. 4 exemplarily illustrates the process flow diagram of the acoustic model merging method of a third embodiment in accordance with the invention.

Fig. 5 illustrates the schematic block diagram of the exemplary configuration of acoustic model merging equipment according to an embodiment of the invention.

Fig. 6 is the schematic block diagram that the exemplary configuration of the merge cells in acoustic model merging equipment is according to an embodiment of the invention shown.

Fig. 7 is the schematic block diagram that another exemplary configuration of the merge cells in acoustic model merging equipment is according to an embodiment of the invention shown.

Fig. 8 shows according to the schematic block diagram of the configuration of the speech recognition system of exemplary embodiment of the present invention.

Embodiment

It should be noted that following exemplary embodiment is not intended to limit the scope of claims, and all combinations of the feature of describing are in the exemplary embodiment not must be essential for solution of the present invention.Each in the exemplary embodiment of the following description of the present invention can be implemented individually, or combination is to implement as the combination of multiple embodiment or their feature useful from the key element of each embodiment or feature in the case of necessary or in single embodiment.

Due in these accompanying drawings similarly Reference numeral for key element like representation class, so, in instructions, will can not be repeated in this description for these similar key elements, and those of ordinary skill in the art are understood that implication like these similar key element representation classes.

In the disclosure, term " first ", " second " etc. are only used to distinguish between key element, and are not intended to represent time sequencing, priority or importance etc.

And, in the disclosure, the execution sequence of step be not must according in the shown and embodiment of process flow diagram, mention like that, but can carry out flexible according to actual conditions,, the present invention should not be subject to the restriction of the execution sequence of the shown step of process flow diagram.

And, in the disclosure, can be for example word, word, sound/simple or compound vowel of a Chinese syllable, phoneme etc. for the modeling unit that forms acoustic model, and be not limited to these.For different language, modeling unit may be different.

In addition, " importance of modeling unit " or " significance level of modeling unit " that in the disclosure, occur at least comprise following situation: taking phoneme as example, it can refer to that the phoneme that the frequency of occurrences is high is in daily life important, or the phoneme that its position plays a decisive role for pronunciation is important, or the diverse phoneme that pronounces is important.But be not limited to these situations recited above.For example, and the important modeling unit (phoneme) in every kind of language may be different.

Thus, a property feature of the present invention is to reflect importance or the significance level of this modeling unit in the language that will identify with the distributed intelligence of modeling unit.The distributed intelligence of reflection " importance of modeling unit " or " significance level of modeling unit " of can being used for like this can obtain based on experience, also can be by adding up storehouse and obtain from training.

In addition, in the present invention, the parameter for forming modeling unit (such as average, variance, Gaussian Mixture, state, hidden Markov model etc.) is referred to as to model-composing key element.In the situation that not particularly pointing out, while mentioning in this manual model-composing key element, represent to refer to all for form the parameter of modeling unit or these parameters one of at least.

Below, come with reference to the accompanying drawings exemplary embodiment of the present invention to be described in detail.

Fig. 1 is the block diagram that the hardware configuration of the computer system 1 that can implement embodiments of the invention is shown.

As shown in fig. 1, computer system 1 comprises computing machine 1110.Computing machine 1110 comprises the processing unit 1120, system storage 1130, fixed non-volatile memory interface 1140, removable non-volatile memory interface 1150, user's input interface 1160, network interface 1170, video interface 1190 and the output peripheral interface 1195 that connect via system bus 1121.

System storage 1130 comprises ROM(ROM (read-only memory)) 1131 and RAM(random access memory) 1132.BIOS(Basic Input or Output System (BIOS)) 1133 reside in ROM1131.Operating system 1134, application program 1135, other program modules 1136 and some routine data 1137 reside in RAM1132.

Fixed non-volatile memory 1141 such as hard disk is connected to fixed non-volatile memory interface 1140.Fixed non-volatile memory 1141 for example can storage operation system 1144, application program 1145, other program modules 1146 and some routine data 1147.

Removable nonvolatile memory such as floppy disk 1151 and CD-ROM drive 1155 is connected to removable non-volatile memory interface 1150.For example, diskette 1 152 can be inserted in floppy disk 1151, and CD(CD) 1156 can be inserted in CD-ROM drive 1155.

Input equipment such as microphone 1161 and keyboard 1162 is connected to user's input interface 1160.

Computing machine 1110 can be connected to remote computer 1180 by network interface 1170.For example, network interface 1170 can be connected to remote computer 1180 via LAN (Local Area Network) 1171.Or network interface 1170 can be connected to modulator-demodular unit (modulator-demodulator) 1172, and modulator-demodular unit 1172 is connected to remote computer 1180 via wide area network 1173.

Remote computer 1180 can comprise the storer 1181 such as hard disk, and it stores remote application 1185.

Video interface 1190 is connected to monitor 1191.

Output peripheral interface 1195 is connected to printer 1196 and loudspeaker 1197.

Computer system shown in Fig. 1 can be incorporated in any embodiment, can be used as stand-alone computer, or also can be used as the disposal system in equipment, can remove one or more unnecessary assembly, also can add one or more additional assembly to it.

User can adopt and use in any way the computer system shown in Fig. 1, and the present invention uses the mode of computer system not to be restricted for user.

Obviously, the computer system shown in Fig. 1 is exemplary, and is certainly not intended to limit the present invention, the application or use of the invention.

[the first embodiment]

Below, describe the first embodiment of the present invention in detail with reference to Fig. 2.

Fig. 2 exemplarily illustrates the process flow diagram of acoustic model merging method according to an embodiment of the invention.

In the present embodiment, carry out union operation for the first acoustic model and the second acoustic model.Wherein, the first acoustic model and the second acoustic model are by using such as (the Maximum Likelihood of the training method based on maximum likelihood rule for different language based on speech data in training storehouse, ML) or the method for discrimination training method (Discriminative Training, DT) and so on train and obtain.Here the people that, the speech data in training storehouse is spoken one's mother tongue by one or more conventionally provides.For example, the first acoustic model (also can be described as general acoustic model UAM) that can be used as main acoustic model can be configured to (for example identify multilingual, English, Chinese etc.) phonetic entry, the second acoustic model that can be used as auxiliary acoustic model (for example can be configured to for example a kind of rare foreign languages language of identification or one group of rare foreign languages language, Dutch, Norwegian, Swedish etc.) phonetic entry.Certainly, can be also the second acoustic model as so main acoustic model, the first acoustic model is as so auxiliary acoustic model.

Below, by such embodiment below the first embodiment illustrated: be used as the model-composing of the modeling unit in the first acoustic model of main acoustic model will usually replace the corresponding model-composing key element in the second acoustic model, then merge the first acoustic model and the second acoustic model, thereby the first acoustic model is not modified, at least keep its performance as main acoustic model.

Particularly, in the present embodiment, the distributed intelligence by modeling unit is weighted the right distance of model-composing key element, then, merges acoustic model according to the result of weighting.

Wherein, the distributed intelligence of modeling unit can reflect the significance level of this modeling unit in the language that will identify.For example, the distributed intelligence of modeling unit can be used for representing frequency or the duration that this modeling unit occurs in the training storehouse of the acoustic model corresponding with it.

As previously mentioned, the distributed intelligence of modeling unit can be by adding up and obtain from training storehouse; And comprise average, variance, Gaussian Mixture, state, hidden Markov model etc. for the model-composing key element that forms modeling unit; Therefore, for example, can use the distributed intelligence of state occupation probability (or state occupies counting) as modeling unit, wherein, state occupation probability (or state occupies counting) can have how many modeling unit (for example phoneme) to use certain state for being illustrated in tranining database.

Therefore,, will describe in detail as an example of state occupation probability example according to the each exemplary step of the acoustic model merging method of the first embodiment of the present invention.

First,, at step S201, obtain the distributed intelligence of modeling unit.Particularly, as mentioned above, can be based on experience or by add up to obtain the distributed intelligence of modeling unit in training storehouse.For example, can count/add up in acoustic model in training and obtain such distributed intelligence.More specifically, for example obtain the distributed intelligence of state occupation probability as modeling unit.

Here the distributed intelligence that it is emphasized that modeling unit not only can be added up and obtain from training storehouse, can also be from the data acquisition of modeling unit self.For example, can represent by phoneme alignment accuracy the distributed intelligence of modeling unit.; in tranining database, carry out phoneme alignment; then use the accuracy of identification of each phoneme to be used as the distributed intelligence of modeling unit (phoneme) for importance or the significance level of reflection phoneme;; can make the phoneme that accuracy of identification is higher more important, thereby can avoid the phoneme that accuracy of identification is high merged as far as possible.

In a word, the present invention does not impose any restrictions for the mode of the distributed intelligence that can obtain modeling unit, as long as it can reflect the significance level of this modeling unit in the language that will identify.

Then,, at step S202, the right distance of computation model inscape, is calculated respectively the right distance of such model-composing key element being made up of the of a sort model-composing key element of the first acoustic model and the second acoustic model that is.Particularly, can calculate the right distance of certain class model inscape of the first acoustic model and the second acoustic model.More specifically, for example, can the right distance of computing mode, that is, and as the state of the first acoustic model of main acoustic model with as the right distance of the state of the second acoustic model of auxiliary acoustic model.Here, preferably each state of each state of the first acoustic model and the second acoustic model is formed to state pair, then calculate the right distance of each state.

The distance calculating method of the right distance that can be used for calculating above-mentioned model-composing key element comprises for example Euclidean distance (can with reference to http://en.wikipedia.org/wiki/Euclidean_distance), K-L distance is K-L divergence (can with reference to http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_d ivergence), mahalanobis distance (can with reference to http://en.wikipedia.org/wiki/Mahalanobis_distance), Pasteur distance (can with reference to http://en.wikipedia.org/wiki/Bhattacharyya_distance).In fact, any range observation (calculating) method can be used in the present invention, that is to say, the present invention does not impose any restrictions for the distance calculating method of the right distance for computation model inscape.

At step S203, utilize the distributed intelligence of the modeling unit obtaining at step S201 to be weighted for the right distance of the model-composing key element obtaining at step S202, to obtain the distance of weighting.Particularly, using the distributed intelligence of the modeling unit that obtains at step S201 as weight, be weighted for the right distance of the model-composing key element obtaining at step S202.More specifically, for example, can the distance of weighting will be multiplied each other to obtain as the distributed intelligence of modeling unit of weight and the right distance of described model-composing key element.

For example, suppose that a state that obtains the first acoustic model and the second acoustic model by step S202 is to (i_320, i_314) K-L distance is 52249.47, wherein i_320 is the state in the first acoustic model, i_314 is the state in the second acoustic model, and, the state occupation probability that is obtained the state i_314 in the second acoustic model by step S201 is 45614.92, by the state occupation probability of state i_314 and above-mentioned this state to (i_320, i_314) distance multiplies each other, and obtains the distance 45614.92 × 52249.47 after weighting.

As mentioned above, in the present embodiment, be weighted for the right distance of model-composing key element with the distributed intelligence of the modeling unit of the second acoustic model.

In another embodiment, can be weighted for the right distance of model-composing key element with the distributed intelligence of the modeling unit of the first acoustic model.

In other embodiments, can be weighted for the right distance of model-composing key element with the distributed intelligence of the modeling unit of the first acoustic model and the second acoustic model.

For example, the distributed intelligence of the modeling unit of the distributed intelligence of the modeling unit of the first acoustic model and the second acoustic model can be averaging, the result of gained (also referred to as average distributed intelligence) is weighted for the right distance of model-composing key element as weight.

Again for example, can also be by the weight with different the distributed intelligence of the distributed intelligence of the modeling unit to the first acoustic model and the modeling unit of the second acoustic model be weighted then summation, after summation, the distributed intelligence of gained, as the weight being weighted for adjusting the distance, is weighted for the right distance of model-composing key element again.Particularly, for example, to the weight of the first acoustic model distribution 0.6, to the weight of the second acoustic model distribution 0.4, be weighted then and sue for peace with the distributed intelligence of the distributed intelligence of the modeling unit of these two weights to the first acoustic model and the modeling unit of the second acoustic model, after weighted sum, the distributed intelligence of gained, as the weight being weighted for adjusting the distance, is weighted for the right distance of model-composing key element again.Here, the weight of distributing to the first acoustic model and the second acoustic model is only the example of enumerating in order to explain the present invention, and in fact, the present invention does not impose any restrictions for the mode assigning weight to the first acoustic model and the second acoustic model.

Thus, in the present invention, the mode being weighted for the right distance of model-composing key element and the distributed intelligence of modeling unit is without any restriction.

Then, at step S204, according to the right of each above-mentioned model-composing key element of sorting of the distance after weighting, to obtain the nearest model-composing key element of the first acoustic model and the second acoustic model.Can carry out above-mentioned sequence with ascending order or descending, the present invention does not do special restriction to this.

Particularly, for example, for respectively from the nearest state among the state of the first acoustic model and the second acoustic model, can be by respectively from the minimum of the distance between the Gaussian Mixture of two states of these two acoustic models with determine the nearest state of these two acoustic models.In addition, can also, by directly more such as, from the Gaussian Mixture of two states or the hidden markov model (HMM) of two modeling unit (phoneme etc.) of the first acoustic model and the second acoustic model, determine the nearest state of these two acoustic models.For example, if state is nearest state to (i_320, i_314), for the state i_314 of the second acoustic model, mean other states than the first acoustic model, nearest between the state i_320 of the first acoustic model and this state i_314.In other words, if state is nearest state to (i_320, i_314), for the state i_320 of the first acoustic model, mean other states than the second acoustic model, nearest between the state i_314 of the second acoustic model and this state i_320.

Then,, at step S205, according to the result of sequence, by the first acoustic model and the second acoustics model combination, obtain the acoustic model merging.

Here can, by the second acoustics model combination to the first acoustic model, not change the first acoustic model, thereby guarantee the precision of identifying speech as the first acoustic model of main acoustic model.Certainly, also can be by the first acoustics model combination to the second acoustic model.

In addition, as mentioned above, can be also the first acoustic model as auxiliary acoustic model, the second acoustic model is as main acoustic model.In this case, be also can be by the first acoustics model combination to the second acoustic model, or can be by the second acoustics model combination to the first acoustic model.Therefore, in the present invention, acoustic model can merge in mode arbitrarily, that is, the mode that merges arbitrarily is all applicable to the present invention, and therefore, the present invention does not need to limit the mode that acoustic model merges.

Preferably, in the present invention, can be according to the result of sequence, replace model-composing key element in the second acoustic model and by the first acoustic model and the second acoustics model combination by the model-composing key element in the first acoustic model, thereby obtain the acoustic model of merging.

Certainly, with the merging of acoustic model above similarly, also can to usually replace model-composing key element in the first acoustic model and by the first acoustic model and the second acoustics model combination with the model-composing in the second acoustic model, thereby obtain the acoustic model merging.

In addition,, except adopting the mode of replacing preferably to realize acoustic model union operation, can also adopt other different mode to realize acoustic model union operation.For example, can adopt the mode of " parameter of the first acoustic model and the second acoustic model (being above-mentioned model-composing key element, such as average, variance, Gaussian Mixture, state, hidden Markov model etc.) is weighted on average " to merge acoustic model.Particularly, this average weighted mode is: assign weight to the parameter of the first acoustic model and the relevant parameter of the second acoustic model, then try to achieve the weighted mean value of the parameter of these two acoustic models, and using this weighted mean value as new acoustic model the parameter of (, the acoustic model after merging).

To only further describe the present invention in the mode of replacing as example below, but those of ordinary skill in the art will understand, the present invention is not limited to such mode, and such mode is only one of preferred embodiment below.

For example, (the nearest state in the first acoustic model that step S204 obtains of can being used in replaces corresponding state in the second acoustic model, that state with nearest the second acoustic model of this state of the first acoustic model), then, the second acoustic model by the first acoustic model and after being replaced (, the second acoustic model of amendment) combination, obtain the acoustic model (also referred to as the acoustic model of binding) merging.Thus, at least one state in the second acoustic model is replaced by the state in the first acoustic model, and the state being replaced in the second acoustic model can be deleted.Thereby, can reduce the size of the second acoustic model, can reduce thus the size of the acoustic model of binding.

Here, the quantity of the state being replaced in the second acoustic model can be determined according to practical application.Preferably, the quantity of the state being replaced in the second acoustic model is not more than a default threshold value.If the state being replaced is more, can makes the size of the acoustic model of binding decline morely, but may cause the hydraulic performance decline of the acoustic model of binding.For example, if 950 states in the second acoustic model are replaced by the state in the first acoustic model, mean that 950 corresponding previous status in the second acoustic model are by deleted, the large young pathbreaker of the acoustic model of binding reduces 950.

So preferably, the quantity that can be the state being replaced in the second acoustic model according to actual conditions arranges a suitable threshold value.

By above-mentioned according to the first embodiment of the present invention, thereby acoustic model can be merged effectively reduce in right amount the size of acoustic model, and significantly do not reduce the performance of acoustic model, thereby make it possible to utilize the multilingual acoustic model obtaining to carry out accurately and efficiently speech recognition.Particularly, only select and replace the not too important model-composing key element in acoustic model by the consideration of the importance based on for modeling unit, realize the present invention.

[the second embodiment]

In the first embodiment, use the distributed intelligence of modeling unit, for example state occupation probability, the right distance of the model-composing key element model-composing key element of a classification by acoustic model being formed as weight is weighted.The key distinction of the second embodiment and the first embodiment is, can be weighted the right distance of the model-composing key element being made up of the model-composing key element of at least two classifications of acoustic model respectively as weight with the distributed intelligence of modeling unit.

Below, describe the second embodiment of the present invention in detail with reference to Fig. 3.

First, at step S301, similar with the step S201 in the first embodiment, obtain the distributed intelligence of modeling unit.Particularly, can be based on experience or by add up to obtain the distributed intelligence of modeling unit in training storehouse.

As the first embodiment, the distributed intelligence of modeling unit not only can be added up and obtain from training storehouse, can also be from the data acquisition of modeling unit self.In a word, the present invention does not impose any restrictions for the mode of the distributed intelligence that can obtain modeling unit, as long as it can reflect the significance level of this modeling unit in the language that will identify.

Then, at step S302, for the model-composing key element of a classification, with step S202 in the first embodiment similarly, calculate the right distance of such model-composing key element.Particularly, for example, the model-composing key element of a described classification can be state,, can the right distance of computing mode, that is, calculate as the state of the first acoustic model of main acoustic model with as the right distance of the state of the second acoustic model of auxiliary acoustic model.Similar with the first embodiment,, preferably each state of each state of the first acoustic model and the second acoustic model is formed to state pair here, then calculate the right distance of each state.

Here, as mentioned above, the present invention does not impose any restrictions for the distance calculating method of the right distance for computation model inscape.

At step S303, with step S203 in the first embodiment similarly, utilize the distributed intelligence of the modeling unit obtaining at step S301 to be weighted for the right distance of the model-composing key element of the described classification obtaining at step S302, to obtain the distance of weighting.For example, can the distance of weighting will be multiplied each other to obtain as the distributed intelligence of modeling unit of weight and the right distance of the model-composing key element of a described classification.

As mentioned above, can be weighted for the right distance of model-composing key element with the distributed intelligence of the modeling unit of the second acoustic model, also can be weighted for the right distance of model-composing key element with the distributed intelligence of the modeling unit of the first acoustic model, can also use the distributed intelligence (be for example averaging by the distributed intelligence of the modeling unit for the two or by the mode of weighted sum) of the modeling unit of the first acoustic model and the second acoustic model to be weighted for the right distance of model-composing key element.In a word, in the present invention, the mode being weighted for the right distance of model-composing key element and the distributed intelligence of modeling unit is without any restriction.

Then, at step S304, with step S204 in the first embodiment similarly, according to sort model-composing key element right of each above-mentioned described classification of the distance after weighting, to obtain the nearest model-composing key element of the first acoustic model and the second acoustic model.Can carry out above-mentioned sequence with ascending order or descending, the present invention does not do special restriction to this.

Then,, at step S305, by the first acoustic model and the second acoustics model combination, obtain the acoustic model merging.Preferably, the second acoustics model combination, to the first acoustic model, not change the first acoustic model, thereby is guaranteed to the precision of identifying speech as the first acoustic model of main acoustic model.

The difference of the second embodiment and the first embodiment is mainly the step S305 that acoustic model merges.Step S305 will be described particularly so that this difference to be shown below.

At step S3051, according to the right ranking results of such model-composing key element, replace the corresponding model inscape in the second acoustic model by the model-composing key element of the described classification in the first acoustic model, obtain the second acoustic model of the first amendment.

As described in the first embodiment, be state in the model-composing key element of a described classification, (the nearest state in the first acoustic model that step S304 obtains of for example can being used in replaces corresponding state in the second acoustic model, that state with nearest the second acoustic model of this state of the first acoustic model), thereby the second acoustic model that obtains being replaced (, the second acoustic model of the first amendment).

Here, preferably, the quantity of the state being replaced in the second above-mentioned acoustic model is not more than a default threshold value Th1.

Then, at step S3052, for the model-composing key element of other classifications different from the model-composing key element of a described classification, calculate respectively the right distance of other class models inscapes that formed by the model-composing key element of described other classifications of the first acoustic model and the model-composing key element of described other classifications of the second acoustic model.Here, described other classifications can comprise at least one classification.For example, be state in the model-composing key element of a described classification, the model-composing key element of described other classifications can be hidden markov model, Gaussian Mixture, variance and/or average etc.

At step S3053, S303 is similar with step, utilizes the distributed intelligence of the modeling unit obtaining at step S301 to be weighted for the right distance of the model-composing key element of each corresponding other classifications.

At step S3054, S304 is similar with step, right according to sort each of model-composing key element of described other classifications of the distance after the weighting of carrying out at step S3053.

At step S3055, S3051 is similar with step, according to each right ranking results of the model-composing key element of described other classifications, replace the corresponding model-composing key element of described other classifications in the second acoustic model by the model-composing key element with minor increment in described other classifications in the first acoustic model, thereby obtain the second acoustic model of at least one the second amendment.

At step S3056, the second acoustic model by the second acoustic model to described the first amendment and described at least one the second amendment by weight is weighted, then by the second acoustics model combination of the second acoustic model of described the first amendment after weighting and described at least one the second amendment, obtain mixing the second acoustic model.

Here, the weight of distributing to the second acoustic model of the second acoustic model of the first amendment and the second amendment for example can be between 0～1, but the present invention does not restrict this.

At step S3057, the first acoustic model is closed with mixing the second acoustics model group, obtain the acoustic model (, the acoustic model of binding) merging.

Here, the execution sequence that it should be noted that each step be not must be according to process flow diagram shown and mentioned above like that, but can carry out flexible according to actual conditions,, the present invention should not be subject to the restriction of the execution sequence of the shown step of process flow diagram.

For example, for obtaining before the step S3051 of the second acoustic model of the first amendment can be positioned at the step S3052～S3056 of the second acoustic model for obtaining the second amendment, as Fig. 3 with as described in this manual, also, after can being positioned at described step S3052～S3056, this does not affect for essence of the present invention.

In addition, by step S3052～S3056, can obtain the second acoustic model of one second amendment, also can obtain the second acoustic model of multiple the second amendments, those of ordinary skill in the art understand variation wherein.And, for convenience of description, shown in Figure 3 is that step S3052～S3056 is performed second acoustic model that just can obtain at least one (one or more) second amendment for a time, in fact, can step S3052～S3056 of every execution or the second acoustic model that similarly step just obtains one second amendment.

Here the second acoustic model that it should be noted that each the second amendment, is corresponding to different classes of model-composing key element.

In addition, corresponding with the quantity of the second acoustic model of the second amendment with the quantity that at least one second second acoustic model of revising is weighted used weight to the second acoustic model of the first amendment in step S3056.,, the second acoustic model that obtains one second amendment at step S3055, in step S3056, being weighted used weight is 2 (, quantity+1 of the second acoustic model of the second amendment).The second acoustic model that obtains two second amendments at step S3055, in step S3056, being weighted used weight is 3.

In the present embodiment, by using different classes of model-composing key element, than the model-composing key element that only uses a classification in the first embodiment, can obtain more accurate voice identification result.

[the 3rd embodiment]

In the first and second embodiment, the first acoustic model and the second acoustic model are merged and obtain the acoustic model of binding.The key distinction of the 3rd embodiment and the first and second embodiment is, can will obtain the acoustic model of binding more than the acoustic model of two merges.

Below, describe the third embodiment of the present invention in detail with reference to Fig. 4.

Step S401～S405 in Fig. 4 can with the first embodiment in S201～S205 similar, also can with the second embodiment in S301～S305 similar.

At step S406, for other acoustic models different with the second acoustic model from the first acoustic model, still can merge.Can adopt above method described in the first embodiment of the present invention or the second embodiment with the merging of described other acoustic models, also can adopt other method, for example conventional method in prior art, the present invention is not restricted this.

In addition, described other acoustic models can comprise at least one acoustic model.In the situation that described other acoustic models are multiple acoustic model, can one by one itself and the acoustic model of the merging obtaining at step S405 be merged.The present invention does not also impose any restrictions this.

In addition, in the case of merging more than the acoustic model of two, the mode that acoustic model merges is also similar with the situation that two acoustic models merge,, as mentioned above, the mode that the present invention merges for acoustic model does not impose any restrictions, that is, the mode that merges arbitrarily is all applicable to the present invention.

[the 4th embodiment]

The exemplary configuration that merges according to an embodiment of the invention equipment 1000 for merging the acoustic model of the multiple acoustic models that comprise the first acoustic model and the second acoustic model is described hereinafter with reference to Fig. 5～Fig. 7.

Fig. 5 illustrates the schematic block diagram of the exemplary configuration of acoustic model merging equipment according to an embodiment of the invention.Fig. 6 is the schematic block diagram that the exemplary configuration of the merge cells in acoustic model merging equipment is according to an embodiment of the invention shown.Fig. 7 is the schematic block diagram that another exemplary configuration of the merge cells in acoustic model merging equipment is according to an embodiment of the invention shown.

Acoustic model merging equipment 1000 can comprise according to an embodiment of the invention: distributed intelligence acquiring unit 1001, be configured to the distributed intelligence of the modeling unit that obtains at least the first and/or second acoustic model, wherein, described distributed intelligence can reflect the significance level of described modeling unit in the language that will identify; Metrics calculation unit 1002, is configured to calculate respectively the right distance of such model-composing key element being made up of the of a sort model-composing key element of the first acoustic model and the second acoustic model; Weighted units 1003, is configured to utilize described distributed intelligence to be weighted for the right distance of each corresponding such model-composing key element; Sequencing unit 1004, is configured to according to the right of each such model-composing key element of sorting of the distance after weighting; And merge cells 1005, be configured to according to the result of sequence, by the first acoustic model and the second acoustics model combination, obtain the acoustic model merging.

Wherein, any one in can be in the following way of described weighted units 1003 carried out described weighting:

The described distributed intelligence of the modeling unit of the first acoustic model and the right distance of corresponding such model-composing key element are multiplied each other;

The described distributed intelligence of the modeling unit of the second acoustic model and the right distance of corresponding such model-composing key element are multiplied each other;

The distributed intelligence of the modeling unit of the first acoustic model and the second acoustic model is averaging, thereby obtain the mean value of the distributed intelligence of the modeling unit of the first acoustic model and the second acoustic model, and described mean value and the right distance of corresponding such model-composing key element are multiplied each other; With

The distributed intelligence of the modeling unit by predetermined different weight to the first acoustic model and the second acoustic model is weighted and sues for peace, thereby obtain the weighted sum of the distributed intelligence of the modeling unit of the first acoustic model and the second acoustic model, and described weighted sum and the right distance of corresponding such model-composing key element are multiplied each other.

In addition, as shown in Figure 6, described according to an embodiment of the invention merge cells 1005 can comprise: replace parts 10051, be configured to according to the result of sequence, replace the corresponding model inscape in such model-composing key element in the second acoustic model by the model-composing key element with minor increment in such model-composing key element in the first acoustic model, to obtain the second acoustic model of the first amendment; And combiner 10052, be configured to the second acoustics model group of the first acoustic model and the first amendment to close, obtain the acoustic model merging.

Alternatively, as shown in Figure 7, described according to an embodiment of the invention merge cells can comprise: first replaces parts 10051 ', according to the result of sequence, replace the corresponding model inscape in such model-composing key element of the second acoustic model by the model-composing key element with minor increment in such model-composing key element of the first acoustic model, to obtain the second acoustic model of the first amendment; Second distance calculating unit 10052 ', for the model-composing key element of other classes different from such model-composing key element, the right distance of calculating respectively other class model inscapes that formed by the model-composing key element of described other classes of the first acoustic model and the second acoustic model, wherein said other classes comprise at least one classification; The second weighting parts 10053 ', utilize described distributed intelligence to be weighted for the right distance of the model-composing key element of each corresponding other classes; Second row prelude part 10054 ', right according to sort each of model-composing key element of described other classes of the distance after weighting; Second replaces parts 10055 ', according to each right ranking results of the model-composing key element of described other classes, replace the corresponding model-composing key element of described other classes in the second acoustic model by the model-composing key element with minor increment in described other classes in the first acoustic model, thereby obtain the second acoustic model of at least one the second amendment; And mixed weighting parts 10056 ', the second acoustic model by the second acoustic model to described the first amendment and described at least one the second amendment by weight is weighted, then by the second acoustics model combination of the second acoustic model of described the first amendment after weighting and described at least one the second amendment, obtain mixing the second acoustic model; And combiner 10057 ', the first acoustic model is closed with mixing the second acoustics model group, obtain the acoustic model merging.

In addition, described merge cells can also further be configured to the acoustic model of described merging and the acoustic model except the first and second acoustic models to merge.

Merge equipment by above-mentioned acoustic model according to an embodiment of the invention, thereby acoustic model can be merged effectively reduce in right amount the size of acoustic model, and significantly do not reduce the performance of acoustic model, thereby make it possible to utilize the acoustic model of the merging obtaining to carry out accurately and efficiently speech recognition.

[the 5th embodiment]

Hereinafter with reference to Fig. 8, speech recognition system 10 is according to an embodiment of the invention described.

The speech recognition system 10 of embodiments of the invention can comprise according to acoustic model merging equipment 1000 of the present invention.

In addition, audio recognition method can carry out speech recognition with the acoustic model obtaining by acoustic model merging method according to the present invention according to an embodiment of the invention.

By above-mentioned speech recognition system according to an embodiment of the invention and audio recognition method, thereby acoustic model can be merged effectively reduce in right amount the size of acoustic model, and significantly do not reduce the performance of acoustic model, thereby make it possible to utilize the acoustic model of the merging obtaining to carry out accurately and efficiently speech recognition.

In addition, the present invention can be applied to the various electronic equipments that comprise speech recognition system, and described electronic equipment for example includes but not limited to, audio frequency apparatus (mp3, mp4), video equipment, panel computer, computing machine, PDA, mobile phone etc.

In addition, note that and can implement in many ways acoustic model merging method of the present invention and acoustics model combination equipment.For example, can implement acoustic model merging method of the present invention and acoustics model combination equipment by software, hardware, firmware or its any combination.The order of above-mentioned method step is only exemplary, and method step of the present invention is not limited to above specifically described order, unless otherwise clearly stated.In addition, in certain embodiments, the present invention can also be implemented as the program being recorded in recording medium, and it comprises the machine readable instructions for realizing the method according to this invention.Thereby the present invention also covers the recording medium of storing the program for realizing the method according to this invention.

Although by example detail display specific embodiments more of the present invention, it should be understood by one skilled in the art that above-mentioned example is only intended that exemplary but not limits the scope of the invention.It should be understood by one skilled in the art that above-described embodiment to be modified and do not depart from the scope and spirit of the present invention.Scope of the present invention is to limit by appended claim.

Claims

1. an acoustic model merging method, for merging the multiple acoustic models that comprise the first acoustic model and the second acoustic model, comprising:

Distributed intelligence obtaining step, obtains the distributed intelligence of the modeling unit of at least the first and/or second acoustic model, and wherein, described distributed intelligence can reflect the significance level of described modeling unit in the language that will identify;

Apart from calculation procedure, calculate respectively the right distance of such model-composing key element being formed by the of a sort model-composing key element of the first acoustic model and the second acoustic model;

Weighting step, utilizes described distributed intelligence to be weighted for the right distance of each corresponding such model-composing key element;

Ordered steps, according to the right of each such model-composing key element of sorting of the distance after weighting; And

Combining step, according to the result of sequence, by the first acoustic model and the second acoustics model combination, obtains the acoustic model merging.

2. acoustic model merging method according to claim 1, wherein, described model-composing key element is at least one in average, variance, Gaussian Mixture, state, hidden Markov model.

3. acoustic model merging method according to claim 1, wherein, the distributed intelligence of described modeling unit is frequency or the duration that this modeling unit occurs in the training storehouse of the acoustic model corresponding with it.

4. acoustic model merging method according to claim 1, wherein, described distance is one of Euclidean distance, K-L divergence, mahalanobis distance and Pasteur's distance.

5. according to the acoustic model merging method described in any one in claim 1～4, wherein, any one in can be in the following way of described weighting is performed:

6. acoustic model merging method according to claim 1, wherein, described combining step comprises:

Replacement step, according to the result of sequence, replace the corresponding model inscape in such model-composing key element of the second acoustic model by the model-composing key element with minor increment in such model-composing key element of the first acoustic model, to obtain the second acoustic model of the first amendment;

Combination step, closes the second acoustics model group of the first acoustic model and the first amendment, obtains the acoustic model merging.

7. acoustic model merging method according to claim 1, described combining step comprises:

The first replacement step, according to the result of sequence, replace the corresponding model inscape in such model-composing key element of the second acoustic model by the model-composing key element with minor increment in such model-composing key element of the first acoustic model, to obtain the second acoustic model of the first amendment;

Second distance calculation procedure, for the model-composing key element of other classes different from such model-composing key element, the right distance of calculating respectively other class model inscapes that formed by the model-composing key element of described other classes of the first acoustic model and the second acoustic model, wherein said other classes comprise at least one classification;

The second weighting step, utilizes described distributed intelligence to be weighted for the right distance of the model-composing key element of each corresponding other classes;

The second ordered steps is right according to sort each of model-composing key element of described other classes of the distance after weighting;

The second replacement step, according to each right ranking results of the model-composing key element of described other classes, replace the corresponding model-composing key element of described other classes in the second acoustic model by the model-composing key element with minor increment in described other classes in the first acoustic model, thereby obtain the second acoustic model of at least one the second amendment; And

Mixed weighting step, the second acoustic model by the second acoustic model to described the first amendment and described at least one the second amendment by weight is weighted, then by the second acoustics model combination of the second acoustic model of described the first amendment after weighting and described at least one the second amendment, obtain mixing the second acoustic model;

Combination step, closes the first acoustic model with mixing the second acoustics model group, obtain the acoustic model merging.

8. according to the acoustic model merging method described in claim 6 or 7, also comprise: the acoustic model of described merging and the acoustic model except the first and second acoustic models are merged.

9. according to the acoustic model merging method described in claim 6 or 7, wherein, till the quantity that the replacement of model-composing key element proceeds to the model-composing key element being replaced reaches corresponding predetermined threshold value.

10. an audio recognition method, comprising:

Carry out speech recognition with the acoustic model obtaining according to the acoustic model merging method described in any one in claim 1～9.

11. 1 kinds of acoustic models merge equipment, for merging the multiple acoustic models that comprise the first acoustic model and the second acoustic model, comprising:

Distributed intelligence acquiring unit, is configured to the distributed intelligence of the modeling unit that obtains at least the first and/or second acoustic model, and wherein, described distributed intelligence can reflect the significance level of described modeling unit in the language that will identify;

Metrics calculation unit, is configured to calculate respectively the right distance of such model-composing key element being made up of the of a sort model-composing key element of the first acoustic model and the second acoustic model;

Weighted units, is configured to utilize described distributed intelligence to be weighted for the right distance of each corresponding such model-composing key element;

Sequencing unit, is configured to according to the right of each such model-composing key element of sorting of the distance after weighting; And

Merge cells, is configured to according to the result of sequence, by the first acoustic model and the second acoustics model combination, obtains the acoustic model merging.

12. acoustic models according to claim 11 merge equipment, and wherein, described model-composing key element is at least one in average, variance, Gaussian Mixture, state, hidden Markov model.

13. acoustic models according to claim 11 merge equipment, and wherein, the distributed intelligence of described modeling unit is frequency or the duration that this modeling unit occurs in the training storehouse of the acoustic model corresponding with it.

14. acoustic models according to claim 11 merge equipment, and wherein, described distance is one of Euclidean distance, K-L divergence, mahalanobis distance and Pasteur's distance.

15. merge equipment according to acoustic model described in any one in claim 11～14, and wherein, described weighted units is further configured to any one in can be in the following way and carries out described weighting:

16. acoustic models according to claim 11 merge equipment, and wherein, described merge cells comprises:

Replace parts, be configured to according to the result of sequence, replace the corresponding model inscape in such model-composing key element of the second acoustic model by the model-composing key element with minor increment in such model-composing key element of the first acoustic model, to obtain the second acoustic model of the first amendment;

Combiner, is configured to the second acoustics model group of the first acoustic model and the first amendment to close, and obtains the acoustic model merging.

17. acoustic models according to claim 11 merge equipment, and described merge cells comprises:

First replaces parts, according to the result of sequence, replace the corresponding model inscape in such model-composing key element of the second acoustic model by the model-composing key element with minor increment in such model-composing key element of the first acoustic model, to obtain the second acoustic model of the first amendment;

Second distance calculating unit, for the model-composing key element of other classes different from such model-composing key element, the right distance of calculating respectively other class model inscapes that formed by the model-composing key element of described other classes of the first acoustic model and the second acoustic model, wherein said other classes comprise at least one classification;

The second weighting parts, utilize described distributed intelligence to be weighted for the right distance of the model-composing key element of each corresponding other classes;

Second row prelude part, right according to sort each of model-composing key element of described other classes of the distance after weighting;

Second replaces parts, according to each right ranking results of the model-composing key element of described other classes, replace the corresponding model-composing key element of described other classes in the second acoustic model by the model-composing key element with minor increment in described other classes in the first acoustic model, thereby obtain the second acoustic model of at least one the second amendment; And

Mixed weighting parts, the second acoustic model by the second acoustic model to described the first amendment and described at least one the second amendment by weight is weighted, then by the second acoustics model combination of the second acoustic model of described the first amendment after weighting and described at least one the second amendment, obtain mixing the second acoustic model;

Combiner, closes the first acoustic model with mixing the second acoustics model group, obtain the acoustic model merging.

18. merge equipment according to the acoustic model described in claim 16 or 17, and described merge cells is further configured to: the acoustic model of described merging and the acoustic model except the first and second acoustic models are merged.

19. merge equipment according to acoustic model described in claim 16 or 17, wherein, and till the quantity that the replacement of model-composing key element proceeds to the model-composing key element being replaced reaches corresponding predetermined threshold value.

20. 1 kinds of speech recognition systems, comprise that the acoustic model described in any one in claim 11～19 merges equipment.