WO2000051107A1 - Speech recognition and signal analysis by straight search of subsequences with maximal confidence measure - Google Patents

Speech recognition and signal analysis by straight search of subsequences with maximal confidence measure Download PDF

Info

Publication number
WO2000051107A1
WO2000051107A1 PCT/IB2000/000189 IB0000189W WO0051107A1 WO 2000051107 A1 WO2000051107 A1 WO 2000051107A1 IB 0000189 W IB0000189 W IB 0000189W WO 0051107 A1 WO0051107 A1 WO 0051107A1
Authority
WO
WIPO (PCT)
Prior art keywords
subsequences
recognition
confidence measure
phoneme
normalization
Prior art date
Application number
PCT/IB2000/000189
Other languages
French (fr)
Inventor
Marius Calin Silaghi
Original Assignee
Marius Calin Silaghi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Marius Calin Silaghi filed Critical Marius Calin Silaghi
Publication of WO2000051107A1 publication Critical patent/WO2000051107A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • the invention relates to a common component of:
  • This invention addresses the problem of keyword spotting (KWS) in unconstrained speech without explicit modeling of non-keyword segments (typically done by using filler HMM models or an ergodic HMM composed of context dependent or independent phone models without lexical constraints).
  • WLS keyword spotting
  • DTW Dynamic Time Warping
  • Viterbi matching allowing relaxation of the (begin and endpoint) constraints, these are known to require the use of an "appropriate" normalization of the matching scores since segments of different lengths have then to be compared.
  • X — ⁇ zi , , ⁇ ⁇ ⁇ , x n , . . • , X N ⁇ denote the sequence of acoustic vectors in which we want to detect a keyword
  • M be the HMM model of a keyword M and consisting of L states
  • Q ⁇ (ft, q 2 , . . . , q ⁇ , . . . , q ⁇ .
  • expression (1) can easily be estimated by dynamic programming since the sub-sequence and the associated normalizing factor (e — b+ 1) are given. However, in the case of keyword spotting, this expression should be estimated for all possible begin/endpoint pairs ⁇ b, e ⁇ (as well as for all possible word models), and we define the matching score of X on M as:
  • the keyword spotting problem amounts at matching the whole sequence X of length N onto an extended HMM model M consisting of the states ⁇ qc Qi, ⁇ • • , Q L , Q G ⁇ , in which
  • Q ⁇ q G , ...q G , q b , q b+1 , •••, q e , qc ⁇ Q G ⁇ with (b — 1) garbage states q G preceding q b and (N — e) states qo following q e , and respectively emitting the vector sequences Nj 6-1 and X ⁇ + ⁇ associated with the non-keyword segments.
  • a matching score defined as the average observation posterior along the most likely state sequence. It is indeed believed that local posteriors (or likelihood ratios, as in [7]) are more appropriate to the task.
  • the IVD algorithm is based on the same criterion than the filler based approaches (5), but rather than looking for explicit (and empirical) estimates of P(q G ⁇ x n ) we aim at mathematically estimating its value (which will be different and adapted to each utterance) such that solving (5) is equivalent to solving (3).
  • o P(q G ⁇ x n )
  • Update (t t+1) the estimated value of ⁇ t , defined as the average of the local posteriors along the optimal path Q t (matching the X b e t t resulting of (5) on the keyword model) i.e.:
  • each IVD iteration (from the second iteration) will decrease the value of ⁇ t , and the final path yields the same solution than (3) .
  • T(M, X) the DP table of emission probabilities for an utterance X and the states of the hypothesized word W.
  • T(M, X) the DP table of emission probabilities for an utterance X and the states of the hypothesized word W.
  • a path leading to an entry T(k, s) we mean a sequence of entries in the table T, such that there is exactly an entry for each time frame t ⁇ k.
  • DP selects a locally optimal path noted P ⁇ s -
  • P s -
  • the optimality principle of Dynamic Programming requires that the path to the frame k — 1 that minimizes C N , also minimizes C ks for an entry at frame k of table T(M, X). We have proved that the expression 7 does not respect the optimality principle of Dynamic Programming
  • the Dynamic Programming can be viewed as a set of safe prunings that are applied at each entry of the DP table and has the property that only one alternative is maintained. We have thus shown that Dynamic Programming cannot be used, since the principle of optimality is not respected. We try therefore to detect the type of safe pruning that can be done.
  • algorithm 1 computes S(M, X) and Q* from equation (3).
  • the corresponding confidence measure is defined as:
  • This method consists into a breath first Beam Search algorithm. It refers to a set of reduction rules and certain normalizations:
  • the logarithm of the emission posterior is equal with zero.
  • the set of paths/probabilities of having the frame e in the state s is computed as the first N maxima (N can be finite) of the confidence measure for all paths in HMM M of length e and ending in the state s.
  • N can be finite
  • A a x - a 2
  • B (a x - a 2 )(l x + l 2 ) + p x - p 2 , - ⁇ 2 ) ⁇ 2 + Pih - P 2 I 1
  • L L max — max ⁇ i, 2 ⁇
  • L c —B/2A > 0 and L max is the maximum acceptable length for a phoneme.
  • a simplified test may be:
  • N is chosen equal with one, the aforementioned rules are no longer needed, but always we propagate the path with the maximal current estimation of the confidence measure. The obtained results are very good, even if the defined optimum is guaranteed for this method only when N is bigger than the length of the sequence allowed by L max or of the tested sequence.
  • the set of visited paths can be pruned by discarding those that:
  • IVD Iterating Viterbi Decoding
  • the object of the invention consists of:
  • a representation under the form of an HMM is obtained for the subsequences that are looked for (word, protein profile, section of an image of the object).
  • a tool will be obtained (eventually trained Ex: for speech recognition) for the estimation of the posteriors.
  • PAM Generalized Profiles and mutation matrices
  • next 'Beam search' algorithms are implemented according to the description in the corresponding sections. For each pair P — (sample, state) one computes for each corresponding path the sum and length in the last phoneme, as well as the sum over the normalized cumulated posteriors of the previous phonemes (and their number). Also, the entrance and exit samples into the HMM M are computed and propagated like in the previous method, in order to ensure the localization of the subsequence.
  • the 'threshold' is chosen in the wanted point of the ROC curve obtained in tests.
  • the successful alternatives can undergo tests of superior levels like for example a question of confirmation for speech recognition, opinion of one operator, etc.
  • Posteriors are obtained by computing a distance between the color of the model and that of element in the section of the image. If the context requires, the image will be preprocessed to ensure a certain normalization (Ex: changeable conditions of light will make necessary a transformation based on the histogram).
  • the phonemes of the speech recognition correspond to parts of the object.
  • the structure existence of transitions and their probabilities
  • a direction is scanned for the detection of the best fitting and afterwards, other directions will be scanned for discovering new fittings, as well as for testing the previous ones.
  • the final test will be certified by classical methods such as cross-correlation or by the analysis of the contours in the hypothesized position.
  • the recognition of keywords begins to be used in answering automates of banking system as well as telephone and automates for control, sales or information.
  • the method offers a possibility to recognize keywords in spontaneous speech with multiple speakers.

Abstract

The invention belongs to the technical domain of decoding, classification, alignment and matching of data. The invention refers to new methods of keyword spotting in utterances, detection of subsequences in chains of organic matter (DNA) and recognition of objects in images. The proposed methods search in an optimized way the matching that maximizes, over all the possible matchings, certain confidence measures based on normalized posteriors. Three such confidence measures are used, two are inspired from anterior work in Speech Recognition, and the third one is a new one. Application fields for this invention are: man-machine interfaces (using speech recognition; ex: control systems, banking, flight services, etc.), coordination systems (for industrial robots and automata) and development systems for pharmaceutic products.

Description

Speech Recognition and Signal Analysis by straight Search of Subsequences with Maximal Confidence
Measure
1 Field of the invention
The invention relates to a common component of:
• Speech Recognition
• Keyword Spotting
• Segments Alignment for DNA and proteins (Human Genome)
• Recognition of Objects in Images
2 Background Art
This invention addresses the problem of keyword spotting (KWS) in unconstrained speech without explicit modeling of non-keyword segments (typically done by using filler HMM models or an ergodic HMM composed of context dependent or independent phone models without lexical constraints). Although several algorithms (sometimes referred to as "sliding model methods") tackling this type of problem have already been proposed in the past, e.g., by using Dynamic Time Warping (DTW) [4] or Viterbi matching [9] allowing relaxation of the (begin and endpoint) constraints, these are known to require the use of an "appropriate" normalization of the matching scores since segments of different lengths have then to be compared. However, given this normalization and the relaxation of begin/endpoints, straightforward Dynamic Programming (DP) is no longer optimal (or, in other words, the DP optimality principle is no longer valid) and has to be adapted, involving more memory and CPU. Indeed, at any possible ending time e, the match score of the best warp and start time b of the reference has to be computed [4] (for all possible start times b associated with unpruned paths). Moreover, in [9], and in the same spirit than what is presented here, for all possible ending times e, the average observation likelihood along the most likely state sequence is used as scoring criterion. Finally, this adapted DP quickly becomes even more complex (or intractable) for more advanced scoring criteria (such as the confidence measures mentioned below).
More recently, work in the field of confidence level, and in the framework of hybrid HMM/ANN systems, it was shown [1] that the use of accumulated local posterior probabilities (as obtained at the output of a multilayer perceptron) normalized by the length of the word segment (or, better, involving a double normalization over the number of phones and the number of acoustic frames in each phone) was yielding good confidence measures and good scores for the re-estimation of N-best hypotheses. Similar work, where this kind of confidence measure was compared to several alternative approaches, was reported in [8] and confirmed this conclusion. However, so far, the evaluation of such confidence measures involved the estimation and rescoring of Ν-best hypotheses. Similar work and conclusions (also using Ν-best rescoring) were also reported in using likelihood ratio rescoring and non- keyword rejection [7].
2.1 KWS without filler models
Let X — {zi , , ■ ■ ■ , xn, . . • , XN} denote the sequence of acoustic vectors in which we want to detect a keyword, and let M be the HMM model of a keyword M and consisting of L states Q = {(ft, q2, . . . , qι, . . . , q }. Assuming that M is matched to a subsequence N = {xb, • • • , Xe] (1 ≤ b < e < N) of X, and that we have an implicit (not modeled) garbage/filler state qa preceding and following , we define (approximate) the log posterior of a model M given a subsequence Xξ as the average posterior probability along the optimal path, i.e.:
- log P(M\Xξ) ~ —i-^ min. - log P(Qrø
^ τ h^^ {~ losP^lqG)
- ∑[log P(qn\xn) + log P(qn+1 \qn)} n=b
Figure imgf000004_0001
where Q = {qb, qh+l , ■■■, qe} represents one of the possible paths of length (e — b+1) in M, and qn the HMM state visited at time n along Q, with qn (E Q. In this expression, qc represents the "garbage" (filler) state which is simply used here as the non-emitting initial and final state of M . Transition probabilities and can be interpreted as the keyword entrance and exit penalties, as optimized in [3], but these have not been optimized here. In our case, local posteriors P(qι\xn) were estimated as output values of a multilayer perceptron (MLP) used in a hybrid HMM/ANN system [2].
For a specific sub-sequence Xξ, expression (1) can easily be estimated by dynamic programming since the sub-sequence and the associated normalizing factor (e — b+ 1) are given. However, in the case of keyword spotting, this expression should be estimated for all possible begin/endpoint pairs {b, e} (as well as for all possible word models), and we define the matching score of X on M as:
Figure imgf000005_0001
where the optimal begin/endpoints {b*, e*}, and the associated optimal path Q*, are the ones yielding the lowest average local posterior:
<Q*, δ*, e*) = argmin ~ log P(Q\Xξ) (3)
{Q,b,e} - b + 1
Of course, in the case of several keywords, all possible models will have to be evaluated.
As shown in [1, 8], a double averaging involving the number of frames per phone and the number of phones will usually yield slightly better performance:
{Q\ b*, e*) = (4) argmin
Figure imgf000005_0002
where J represents the number of phones in the hypothesized keyword model and q]7 the hypothesized phone q3 for input frame xn.
However, given the time normalization and the relaxation of begin/endpoints, straightforward DP is no longer optimal and has to be adapted, usually involving more memory and CPU. A new (and simple) solution to this problem is proposed in Section 3.1.
2.2 Filler-based KWS
Although various solutions have been proposed towards the direct optimization of (2) as, e.g., in [4, 9], most of the keyword spotting approaches today prefer to preserve the opti- mality and simplicity of Viterbi DP by modeling the complete input [5] and explicitly [6] or implicitly [3] modeling non-keyword segments by using so called filler or garbage models as additional reference models. In this case, we assume that non-keyword segments are modeled by extraneous garbage models/states qG (and grammatical constraints ruling the possible keyword/non-keyword sequences) .
Let us consider only the case of detecting one keyword per utterance at a time. In this case, the keyword spotting problem amounts at matching the whole sequence X of length N onto an extended HMM model M consisting of the states {qc Qi, ■ • • , QL, QG}, in which
6-1 N-e a path (of length N) is denoted Q = {qG, ...qG, qb, qb+1, •••, qe, qc ■■■QG} with (b — 1) garbage states qG preceding qb and (N — e) states qo following qe, and respectively emitting the vector sequences Nj6-1 and X^+\ associated with the non-keyword segments.
Given some estimation of
Figure imgf000006_0001
(e.g., using probability density functions trained on non keyword utterances), the optimal path Q* (and, consequently b* and e*) is then given by:
Q* = argmin — log P(Q\X)
VQeM
= argmin{- log P(Q\Xξ)
VQe 6-1 N
- ∑ log
Figure imgf000006_0002
- ∑ log P(qG\xn)} (5) n=l π=e+l which can be solved by straightforward DP (since all paths have the same length). The main problem of filler-based keyword spotting approaches is then to find ways to best estimate in order to minimize the error introduced by the approximations. In [3], this value was defined as the average of the N best local scores while, in other approaches, this value is generated from explicit filler HMMs. However, these approaches will usually not lead to the "optimal" solution given by (2).
3 Disclosure of Invention
3.1 Iterating Viterbi Decoding (IVD)
In the following, we show that it is possible to define an iterative process, referred to as Iterating Viterbi Decoding (IVD) with good/fast convergence properties, estimating the value of P qG\xn) sucn that straightforward DP (5) yields exactly the same segmentation (and recognition results) than (3). While the same result could be achieved through a modified DP in which all possible combinations (all possible begin/endpoints) would be taken into account, it is possible to show that the algorithm proposed below is more efficient (in terms of both CPU and memory requirements).
Here, I will use a similar scoring technique for keyword spotting without explicit filler model. Compared to previously devised "sliding model" methods (such as [4, 9]), the first algorithm proposed here is based on:
1. A matching score defined as the average observation posterior along the most likely state sequence. It is indeed believed that local posteriors (or likelihood ratios, as in [7]) are more appropriate to the task.
2. The iteration of a Viterbi decoding algorithm, which does not require scoring for all begin/endpoints or N-best rescoring, and which can be proved to (quickly) converge to the "optimal" (from the point of view of the chosen scoring functions) solution without requiring any specific filler models, using straightforward Viterbi alignments (similar to regular filler-based KWS, but at the cost of a few iterations).
3.2 IVD: Description
The IVD algorithm is based on the same criterion than the filler based approaches (5), but rather than looking for explicit (and empirical) estimates of P(qG\xn) we aim at mathematically estimating its value (which will be different and adapted to each utterance) such that solving (5) is equivalent to solving (3). Thus, we perform an iterative estimation oϊ P(qG\xn), such that the segmentation resulting of (5) is the same than what would be obtained from (3). Defining ε = — log
Figure imgf000007_0001
the proposed algorithm can be summarized as follows:
1. Start from an initial value ε0 = ε (it is actually proven that the iterative process presented here will always converge to the same solution (in more or less cycles, with the worst case upper bound of N iterations) independently of this initialization), (e.g., with ε equal with a cheap estimation of the score of a "match"). In the experiments reported below, ε was initialized to — log of the maximum of the local probabilities P(qι~\xn) for each frame xn.
An alternative choice could be to initialize ε0 to a pre-defined score that expression (1) should reach to declare a keyword "matching" (see point 4 below). In this last case, if ε increases at the first iteration, then we can (as proven) directly infer that the match will be rejected, otherwise it will be accepted. 2. Given the current estimate εt of P(qG\xn) at iteration t, find the optimal path (Qt, bt, et) according to (5) and matching the complete input.
3. Update (t = t+1) the estimated value of εt, defined as the average of the local posteriors along the optimal path Qt (matching the Xb e t t resulting of (5) on the keyword model) i.e.:
4. Return to (2) and iterate until convergence. If we are not interested in the optimal segmentation, this process could also be stopped as soon as ε reaches a (pre-defined) minimum threshold below which we can declare that a keyword has been detected.
Correctness and convergence proof of this process and generalization to other criteria, are available: each IVD iteration (from the second iteration) will decrease the value of εt, and the final path yields the same solution than (3) .
3.3 One-pass keyword spotting
3.3.1 General Description
The above algorithm has a very good experimental convergence speed (3-5 iterations in our tests). However, the worst case theoretical convergence speed of the process is N. For this reason, a one step computation is potentially interesting. In the next subsection we show that the standard DP cannot be used for solving the equation (3).
3.3.2 The Principle of Optimality
Let us define T(M, X) as the DP table of emission probabilities for an utterance X and the states of the hypothesized word W. When solving by standard DP, we would compute for each entry of the table T(M, X) at frame k of X and state s of M three values: SkS, Lks and CkS, where Sks corresponds to the sum of the posteriors on the optimal path that leads to the entry, Lks holds the length of the optimal path computed so far, and Cks is the estimation of the cost on the optimal expanded path.
By a path leading to an entry T(k, s) we mean a sequence of entries in the table T, such that there is exactly an entry for each time frame t<k. At each entry T(k, s), DP selects a locally optimal path noted P^s- At each step k, we consider all pairs of entries of table T(M, X) of type T(k, s), T(k — 1, t).
We update for each such pair, the current cost Cks (initially oo), by comparing it with the alternative given by:
Figure imgf000009_0001
Lks = L'k-i)t + 1, Vt > 0, t<L
Cks = f (7) wanting to have at step k the path Pks from the paths P(k-i)t that minimizes CNL- With DP, one will choose the Pks with minimal Cks-
In order for the previous computation to be correct, the optimality principle needs to be respected. The optimality principle of Dynamic Programming requires that the path to the frame k — 1 that minimizes CN , also minimizes Cks for an entry at frame k of table T(M, X). We have proved that the expression 7 does not respect the optimality principle of Dynamic Programming
3.3.3 Pruning with beam search
The Dynamic Programming can be viewed as a set of safe prunings that are applied at each entry of the DP table and has the property that only one alternative is maintained. We have thus shown that Dynamic Programming cannot be used, since the principle of optimality is not respected. We try therefore to detect the type of safe pruning that can be done.
We have proved that if at a frame a we have two paths Pa' and P" with S" < Sa' and L'a < L", then at no frame c>a will a path Pc" be forsaken for a path Pc' if Pa'< Pc', P"cP" and Pc'\Pa'≡P"\P" . We will note the order relation as P"- P^. We have further shown that a path P' may be discarded only for a lower cost one, P" .
P'^P" => Ck' < C' (8)
Thus, algorithm 1 computes S(M, X) and Q* from equation (3).
By ordering the set of paths, according to Equation 8, we only need to check the line 1.2 of algorithm 1 up to the eventual insertion place. The last paths are candidates for pruning in line 1.1. In order for the pruning to be acceptable, we will prune only paths that were too long on the last state. An additional counter is needed for storing the state length. This counter is reset when the state is changed and is incremented at each advance with a frame. procedure OneStep W^N/ 1 SetOfPaths(l..N, l..K)<-0 for all frame=l; frame <= N; frame++ do for all state=l; state <= K; state++ do for all candidate ^SetOf Paths (frame- 1, 1..K) do
I Add(j9l 5 SetOfPaths[frame, state]) end end end
SetOfPaths[frame, K] - - best of the candidates end. procedure Add (path, set -of-paths) for all PtEset-of-paths do
s
Figure imgf000010_0001
Algorithm 1: One Step Algorithm
3.4 One pass confidence-based keyword spotting 3.4.1 The Method of Double Normalization
The corresponding confidence measure is defined as:
NVP Σ ∑pste r ~ ^g(pst)
(9) p e VP length(pt) where NVP stands for the number of visited phonemes and VP stands for the set of visited phonemes. An average is computed over all posteriors pst of the emission probabilities for the time frames matched to the visited phoneme pt. The function length(pτ) gives the number of time frames matched against p% .
This method consists into a breath first Beam Search algorithm. It refers to a set of reduction rules and certain normalizations:
For the state qG, in this method, the logarithm of the emission posterior is equal with zero. For each frame e and for each state s, the set of paths/probabilities of having the frame e in the state s is computed as the first N maxima (N can be finite) of the confidence measure for all paths in HMM M of length e and ending in the state s. The paths that according to the reduction rules will loose the final race when compared with another already known path, will be deleted as well.
We note αi , p\, l\, 2, p2 and l the confidence measure for the previous phonemes, the posterior in the current phoneme and the length in the current phoneme for the path C}ι, respectively the path Q2. The rules that may be used for the reduction of the search space by discarding a path Qγ for a path Q2 are in this case any of the next ones:
1. l2≥h, A > 0, B < 0 and L A + LCB + C > 0
2. l2≥ , A > 0, B ≥ O and C > 0
3. k≥h, A < 0, C > 0 and L2A + LB + C > 0
4. l2>l A = 0, B < 0 and LB + C > 0
where A = ax - a2, B = (ax - a2)(lx + l2) + px - p2,
Figure imgf000011_0001
- α2) ι 2 + Pih - P2I1, L = Lmax — max{ i, 2}, Lc = —B/2A > 0 and Lmax is the maximum acceptable length for a phoneme.
By discarding paths only if one of the above rules is satisfied, the optimum defined by the confidence measure with double normalization can be guaranteed, if no phone may be avoided by the HMM M. Any HMM may be decomposed in HMMs with this quality. The 4-th rule is included in the 3-rd and its test is useless if the last one was already checked.
First test, l2 > tells us if Q2 has chances to eliminate C ι, otherwise we will check if Qι eliminates Q2. These tests were inferred from the conditions of maintaining the final maximal confidence measure while reduction takes place. In order to use the method of double normalization without decomposing HMMs that skip some phonemes, the previous rules are modified taking into account the number of visited phonemes for any path Ei respectively F2 and the number of phonemes that may follow the current state.
A simplified test may be:
• h > , A > , ι > p2 respectively F2>F\ for the HMMs that skips phonemes. This test is weaker than the 2nd reduction rule. For example a path is eliminated by a second path if the first one has an inferior confidence measure (higher in value) for the the previous phonemes, a shorter length and the minus of the logarithm of the cumulated posterior in the current phoneme also inferior (higher in value) to that of the second one.
An additional confidence measure based on the maximal length, Lmax, and on the maximum of the minus of the logarithm of the cumulated and normalized posterior in phoneme, Pmaxi can be used in order to limit the number of stored paths.
• p > LmaxPmax in any state
f > Pmax at the output from a phoneme
where p and 1 are the values in the current phoneme for the minus of the logarithm of cumulated posterior and for the length of the path that is discarded. These tests allow for the elimination of the paths that are too long without being outstanding, respectively of the paths with phonemes having unacceptable scores, otherwise compensated by very good scores in other phonemes.
If N is chosen equal with one, the aforementioned rules are no longer needed, but always we propagate the path with the maximal current estimation of the confidence measure. The obtained results are very good, even if the defined optimum is guaranteed for this method only when N is bigger than the length of the sequence allowed by Lmax or of the tested sequence.
The same approach is valid for the simple normalization, where the HMM for the searched word will be grouped into a single phoneme.
3.4.2 The Method of Real Fitting
We have also defined a new confidence measured that represents differently the exigencies of the recognition. Since the phonemes and the absent states can be modeled by the used HMMs, we find it interesting to request the fitting of each phoneme in the model with a section of the sequence. Therefore, we measure the confidence level of a subsequence as being equal with the maximum over all phonemes of the minus of the logarithm of the cumulated posterior of the phone, normalized with its length.
phonem ~ log(posteriors) max — : (10) phoneme Visited Phonemi phonem length The rule that may be used in this framework for the reduction of the number of visited paths is:
• Q2 is discarded in favor of another path C ι if the confidence measure of the Real Fitting for the previous phonemes is inferior (higher in value) for Q2 compared with Qι , and if p\ < p2 and l2 < l\. where pi, l , p2, l2 represent the minus of the logarithm of the cumulated posterior respectively the number of frames in the current phoneme for the path Q\ respectively Q2.
Similarly to the previous method, the set of visited paths can be pruned by discarding those that:
• p > LmaxPmax in any state
• f > Pmax at the output from a phoneme where p and 1 are the values in the current phoneme for the minus of the logarithm of the cumulated posterior and for the length of the path that is discarded. We recall that the meaning of the constants are the maximal length Lmax, respectively the accepted maxima of the minus of the logarithm of the cumulated and normalized posterior in phoneme, Pmaχ-
3.5 Conclusions
We have thus proposed a new method for keyword spotting, based on recent advances in confidence measures, using local posterior probabilities, but without requiring the explicit use of filler models.
A new algorithm, referred to as Iterating Viterbi Decoding (IVD), to solve the above optimization problem with a simple DP process (not requiring to store pointers and scores for all possible ending and start times), at the cost of a few iterations. Other three beam- search algorithms corresponding to three different confidence measures were also described.
While the proposed approach allows for an easy generalization to more complex criteria, preliminary results obtained on the basis of 100 keywords (and without any specific tuning) appear to be particularly competitive to other alternative approaches.
3.6 The object of the invention consists of:
• Method of recognition of a subsequence using a direct maximization of confidence measures. • The method of IVD for directly maximizing the confidence measures based on simple normalization.
• The use of the confidence measure and method of recognition named 'Real Fitting', based on individual fitting for each phoneme.
• Methods of recognition using simple and double normalization by:
• combining these measures with additional confidence measures mentioned here, respectively the maximal length and real matching limitation.
• The use of the aforementioned methods in keyword recognition.
• The use of the aforementioned methods in subsequence recognition of organic matter.
• The use of the aforementioned methods in recognition of objects in images.
4 Best Mode for Carrying Out the Invention
Execution: It is necessary to use a computer, but the method can also be implemented in hardware.
1. A representation under the form of an HMM is obtained for the subsequences that are looked for (word, protein profile, section of an image of the object).
2. A tool will be obtained (eventually trained Ex: for speech recognition) for the estimation of the posteriors. For example multi-Gaussians, neuronal networks, clusters, database with Generalized Profiles and mutation matrices (PAM, BLOSSUM, etc.).
3. One of the proposed algorithms should be implemented. They yield close performance but the method of Real Fitting coupled with a well checked dictionary should perform best.
For the first algorithm (IVD)
(a) The classic algorithm of Viterbi is implemented with the modification that, for each pair P = {sample, state) one propagates the moments of transition between the state qG and the states of the HMM M for the path that arrives at P. These are inherited from the path that wins the entrance in the pair P, excepting for the moment when their decision is taken, namely when they receive the index of the corresponding sample.
(b) w = — log P(M\Xξ) is computed by subtracting from the cumulated posterior that is returned by the Viterbi algorithm for the path Q\, the value (N — (e — b + 1)) * ε corresponding to the contribution of the states qG and dividing the result through e — 6 + 1. e — 6 + 1 from the previous formula can be factorized outside the fraction.
(c) The initialization of ε is made with an expected mean value. One can use the w that is computed when the state qG is associated with an emission posterior equal to the average of the best K emission probabilities of the current sample as done in the well-known "garbage on-line model" . In this case, K is trained using the corresponding technique.
The next 'Beam search' algorithms, are implemented according to the description in the corresponding sections. For each pair P — (sample, state) one computes for each corresponding path the sum and length in the last phoneme, as well as the sum over the normalized cumulated posteriors of the previous phonemes (and their number). Also, the entrance and exit samples into the HMM M are computed and propagated like in the previous method, in order to ensure the localization of the subsequence.
4. If one searched entity (keyword, sequence, object) can have several HMM models, all of them are taken into consideration as competitors. This is the case of the words with several pronunciations (or of the objects that have different structures in different states, for the recognition in images).
After the computation of the confidence measure for each model of the subsequences, one eliminates those with a confidence measure in disagreement with a 'threshold' that is trained for the configuration and the goal of the given application. For example, for speech recognition with neuronal networks and minus of the logarithm of the posteriors, the 'threshold' is chosen in the wanted point of the ROC curve obtained in tests.
5. The remained alternatives are extracted in the order of their confidence measure and with the elimination of the conflicting alternatives until exhaustion. Each time when an alternative is eliminated, the searched entity with the corresponding HMM is re- estimated for the remaining sections in the sequence in which the search is performed. If the new confidence measure passes the test of the 'threshold', then it will be inserted in the position corresponding to its score in the queue of alternatives.
6. The successful alternatives can undergo tests of superior levels like for example a question of confirmation for speech recognition, opinion of one operator, etc.
7. For objects recognition in images:
Posteriors are obtained by computing a distance between the color of the model and that of element in the section of the image. If the context requires, the image will be preprocessed to ensure a certain normalization (Ex: changeable conditions of light will make necessary a transformation based on the histogram).
The phonemes of the speech recognition correspond to parts of the object. The structure (existence of transitions and their probabilities) can be modified, function of the characteristics detected along the current path. For example, after detecting regions of the object with certain lengths, one can estimate the expected length of the remaining regions. Thus, the number of the expected samples for the future states can be established and the HMM attached to the object will be configured accordingly.
A direction is scanned for the detection of the best fitting and afterwards, other directions will be scanned for discovering new fittings, as well as for testing the previous ones. The final test will be certified by classical methods such as cross-correlation or by the analysis of the contours in the hypothesized position.
5 Industrial Applicability
Here we present some examples for the application of the proposed method in the industry:
e The recognition of keywords begins to be used in answering automates of banking system as well as telephone and automates for control, sales or information. The method offers a possibility to recognize keywords in spontaneous speech with multiple speakers.
• The recognition of DNA sequences is important for the study of the human Genome. One of the biggest problem of the involved techniques consists in the high quantity of data that have to be processed. • The recognition of objects in images is used, among others, in cartography and in the coordination of industrial robots. The method allows a quick estimation of the position of the objects in scenes and can be validated with extra tests, using classical methods of cross-correlation.
References
[1] Bernardis, G. and Bourlard, H., "Improving posterior-based confidence measures in hybrid HMM/ANN speech recognition systems," Proceedings of Intl. Conf. on Spoken Language Processing (Sydney, Australia), pp. 775-778, 1998.
[2] Bourlard, H. and Morgan, N., Connectionist Speech Recognition - A Hybrid Approach, Kluwer Academic Publishers, 1994.
[3] Bourlard, H., D'Hoore, B., and Boite, J.-M., "Optimizing recognition and rejection performance in wordspotting systems," Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (Adelaide, Australia), pp. 1:373-376, 1994.
[4] Bridle, J.S., "An efficient elastic-template method for detecting given words in running speech," Proc. of the Brit. Acoust. Soc. Meeting, pp. 1-4, April 1973.
[5] Rohlicek, J.R., "Word spotting," in Modern Methods of Speech Processing, R.P. Ramachan- dran and R. Mammone (Eds.), Kluwer Academics Publishers, pp. 123-157, 1995.
[6] Rose, R.C. and Paul, D.B.. "A hidden Markov model based keyword recognition system," Proc. of ICASSP'90, pp. 129-132, 1990.
[7] Sukkar, R.A. and Lee, C.-H., "Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition," IEEE Trans, on Speech and Audio Processing, vol. 4, no. 6, pp. 420-429, 1996.
[8] Williams, G. and Renals, S.. "Confidence measures for hybrid HMM/ANN speech recognition," Proceedings of Eurospeech '97, pp. 1955-1958, 1997.
[9] Wilpon, J.G., Rabiner, L.R., Lee C.-H., and Goldman, E.R., "Application of hidden Markov models of keywords in unconstrained speech," Proc. of ICASSP'89, pp. 254-257, 1989.

Claims

Independent Claim 1.
Preamble: Recognizes subsequences, represented as Hidden Markov Models (HMM), that are searched for in a given sequence.
We refer to the confidence measures, that are used for the reclassification of the winning hypotheses in Speech Recognition. These are some examples of such measures:
simple normalization accumulated posterior, normalized with the length of the subsequence
double normalization double normalization of the accumulated posterior over the number of phonemes and over the number of acoustic samples in each phoneme.
characterized by: It allows the additional confidence measure, based on the extremes of the values of the logarithm of the accumulated posterior in each phoneme, normalized with its length. We call this measure 'real fitting'.
phoneme ~ l g(pθsteriθrs) max — phoneme Visited Phonemes phoneme length characterized by: It searches the subsequences that offer the maximization of one mentioned confidence measures, over all possible matchings. characterized by: It allows the revaluation of the alternatives that offer the highest among any mentioned confidence measure on the basis of another confidence measure. characterized by: It computes the alternative that maximizes the 'simple normalization' by using the method that we have called 'Iterative Viterbi Decoding' and that estimates the emission probability of the filler states, in an iterative manner, as being equal to the confidence measure in the previous iteration. characterized by: It computes the alternative that maximizes the 'simple normalization', 'double normalization' or 'real fitting' using an algorithm that considers the emission probability of the filler state as zero. This method computes progressively, for each pair of sample and state of HMM, a set of possible alternatives paths to reach it. The computation of this set is based on the sets of paths that lead to the states that can be associated to the previous sample.
This set can be reduced by using the given appropriate rules for the given confidence measure, ensuring the correctness of the inference.
This set can be also reduced by using heuristics that are based on the aforementioned rules, for speeding up the computation despite the risk of reducing the theoretical quality of the recognition.
Dependent Claim 2.
Preamble: It is based on the Claim 1.
It estimates the existence of keywords and their position in utterances.
characterized by: It uses the methods described in Claim 1, for recognition of subsequences represented by Hidden Markov Models.
Dependent Claim 3.
Preamble: It is based on the Claim 1.
It estimates the existence of biomolecular subsequences and their position in the chains of DNA using models like generalized profiles.
characterized by: The estimation of their existence and position is made according to the methods described in the Claim 1, for recognition of subsequences represented by Hidden Markov Models.
Dependent Claim 4.
Preamble: It is based on the Claim 1. It carries out the estimation of the existence of objects and their position in images.
characterized by: It uses the methods described in Claim 1, for the recognition of subsequences represented by Hidden Markov Models (HMM). characterized by: Sections through views of virtual objects are modeled by sets of Hidden Markov Models. characterized by: It uses a probabilistic model based on a distance computed between colors. characterized by: The Hidden Markov Models that model the objects can be structured of distinct regions, that play in the frame of the method the role of the phonemes. characterized by: The models of the objects can be modified in a dynamic manner with respect to the transition properties (existence and probability) on the basis of the accumulated information during the fitting process.
PCT/IB2000/000189 1999-02-25 2000-02-22 Speech recognition and signal analysis by straight search of subsequences with maximal confidence measure WO2000051107A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RO9900214 1999-02-25
RO99/00214 1999-02-25

Publications (1)

Publication Number Publication Date
WO2000051107A1 true WO2000051107A1 (en) 2000-08-31

Family

ID=20107220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2000/000189 WO2000051107A1 (en) 1999-02-25 2000-02-22 Speech recognition and signal analysis by straight search of subsequences with maximal confidence measure

Country Status (1)

Country Link
WO (1) WO2000051107A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402984A (en) * 2011-09-21 2012-04-04 哈尔滨工业大学 Cutting method for keyword checkout system on basis of confidence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5535305A (en) * 1992-12-31 1996-07-09 Apple Computer, Inc. Sub-partitioned vector quantization of probability density functions
US5638489A (en) * 1992-06-03 1997-06-10 Matsushita Electric Industrial Co., Ltd. Method and apparatus for pattern recognition employing the Hidden Markov Model
US5764851A (en) * 1996-07-24 1998-06-09 Industrial Technology Research Institute Fast speech recognition method for mandarin words

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5638489A (en) * 1992-06-03 1997-06-10 Matsushita Electric Industrial Co., Ltd. Method and apparatus for pattern recognition employing the Hidden Markov Model
US5535305A (en) * 1992-12-31 1996-07-09 Apple Computer, Inc. Sub-partitioned vector quantization of probability density functions
US5764851A (en) * 1996-07-24 1998-06-09 Industrial Technology Research Institute Fast speech recognition method for mandarin words

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402984A (en) * 2011-09-21 2012-04-04 哈尔滨工业大学 Cutting method for keyword checkout system on basis of confidence

Similar Documents

Publication Publication Date Title
EP0635820B1 (en) Minimum error rate training of combined string models
Weintraub LVCSR log-likelihood ratio scoring for keyword spotting
US5579436A (en) Recognition unit model training based on competing word and word string models
US6292779B1 (en) System and method for modeless large vocabulary speech recognition
US6125345A (en) Method and apparatus for discriminative utterance verification using multiple confidence measures
JP3549681B2 (en) Verification of utterance identification for recognition of connected digits
Hazen et al. A comparison and combination of methods for OOV word detection and word confidence scoring
Lleida et al. Utterance verification in continuous speech recognition: Decoding and training procedures
US7324941B2 (en) Method and apparatus for discriminative estimation of parameters in maximum a posteriori (MAP) speaker adaptation condition and voice recognition method and apparatus including these
EP0758781A2 (en) Utterance verification using word based minimum verification error training for recognizing a keyword string
JP2004512544A (en) Discriminatively trained mixed models in continuous speech recognition.
CN101118745A (en) Confidence degree quick acquiring method in speech identification system
McDermott et al. Prototype-based minimum classification error/generalized probabilistic descent training for various speech units
Silaghi et al. Iterative posterior-based keyword spotting without filler models
Anastasakos et al. The use of confidence measures in unsupervised adaptation of speech recognizers.
Jo et al. Modified Viterbi Scoring for HMM-Based Speech Recognition.
Goel et al. LVCSR rescoring with modified loss functions: A decision theoretic perspective
Sukkar Rejection for connected digit recognition based on GPD segmental discrimination
Sukkar Subword-based minimum verification error (SB-MVE) training for task independent utterance verification
JPH1185186A (en) Nonspecific speaker acoustic model forming apparatus and speech recognition apparatus
US6006182A (en) Speech recognition rejection method using generalized additive models
WO2000051107A1 (en) Speech recognition and signal analysis by straight search of subsequences with maximal confidence measure
Gupta et al. Improved utterance rejection using length dependent thresholds.
Silaghi et al. A new keyword spotting approach based on iterative dynamic programming
JP3368989B2 (en) Voice recognition method

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 09647300

Country of ref document: US

122 Ep: pct application non-entry in european phase