CA1203906A

CA1203906A - Variable frame length vocoder

Info

Publication number: CA1203906A
Application number: CA000439473A
Authority: CA
Inventors: Tetsu Taguchi
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1982-10-21
Filing date: 1983-10-21
Publication date: 1986-04-29
Also published as: US4701955A

Abstract

ABSTRACT OF THE DISCLOSURE
A variable frame length vocoder comprising means for obtaining a feature vector from an input speed signal at every given frame, means for extracting the feature vectors in a given section having a predetermined number of frames, means for approximating a change in said feature vectors in said given section with a given number of flat sections variable in the section length and separated from neighboring sections by a constant time length portion and inclined sections connecting said neighboring flat sections with inclined lines in said constant time length portion, means for outputting the feature vector of a given frame in each flat section as a representative vector of said flat section and means for outputting the number of frames present in said flat section as a repeat signal.

Description

~2~

VARIABLE FRAME LENGTH VOCODER

BACKGROUND OF THE INVENTION

This invention relates to a variable frame length vocoder, and more particularly to improvements in a dynamic characteristic of the synthesis filter and the compression cf the data rate.
An LSP vocoder using the so-called LSP (Line Spectrum Pair) as a speech spectrum information is highly appreciated in view of its advantages in that an LOP parameter per se can be handled intuitively as a frequency domain information and that a high quality synthesized speech is obtainable with a little information quantity. The principle and examples are given in detail in the paper by Fumitada Itakura et al. entitled "A HARDWARE IMPLEMENTATION OF A NEW NARROW
TO MEDIUM BAND SPEECH CONDING''I International Conference orl Acoustics Speech and Signal Processing (ICASSP), 19~2, pp. 1964 to 1967.
;A time change of the parameter such as LSP parameter indicating a spectrum information of the speech is relatively gentle but not necessarily uniform. For example, while the parameter abruptly changes at a transistion part of vowel and consonant, the change at a voiced sound part is extremely gentle. Consequentlyl by changing a frame length correspondingly to the time change characteristic of the I

parameters, a further information compression will be attainable as compared with a vocoder with the frame length fixed. The vocoder according to such system is called a variable frame length vocoder, which is proposed in the paper by John M. Turner and Bradley W. Dickinson entitled "A VARIABLE
FRAME LENGTH LINEAR PREDICTIVE CODER", International Conference on Acoustics Speech and Signal Processing (ICASSP), 1978 ppO 454 to 457, and the report by Katsunobu Fushikida:
"A VARIABLE FRAME RATE SPEECH ANALYSIS-SYNTHESIS METHOD
USING OPTIMUM SQUARE WAVE APPROXIMATION", Acoustics Institute of Japan, May 1978, p. 3B5 to 386.
The variable frame length vocoder proposed in the former report uses a long frame interval to a portion with gentle change and a short frame interval to a portion with abrupt change correspondingly to the change characteristic of a spectrum power envelopingO The latter report describes a technique using an optimum rectangular approximation based on dynamic programming (DP) on the basis of the vocoder proposed in the former report. In this technique a predetermined number of frames are classified into a plurality of groups to minimize an error according to an optimum rectangular approximation, and thus a representative frame is obtained from them. However, a parameter between the adjacent representative frames changes sharply stepwise in either of the above systems, which may involve the following problems.

~,'ZID3~

In the variable frame length vocoder, a spectrum information parameter analyzed on an analysis side is given to the synthesis filter as a filter coefficient to change the transfer function of the synthesis filter from time to time. A quality of the speech synthesized by the synthesis filker is not determined only bt the transfer function of the synthesis filter working instantaneously, or static characteristic, but depends largely on a change in the transfer function, or dynamic characteristic. When the transfer function changes steeply and thus the change is nearly stepwise, the so-called "echo sollnd" is generated to deteriorate a quality of the synthesized speech. To suppress the echo sound, the representative frame section obtained on the analysis side is conventionally subjected to a linear interpolation to smooth a time change of the parameter, thereby improvlng a dynamic characteristic of the synthesis filter.
According to this method, however, a spectral characteristic ox the synthesized speech does not coincide precisely with that of an input speech signal, thus generating an unnatural synthesized speech.
Then, in the above-mentioned LSP vocoder, there is an LSP type pattern matching vocoder avallable for carrying out a further information compression. A conception of such pattern matching vocoder is disclosed, for example, in the report by HOMER DUDLEY entitled "Phonetic Pattern ~Z~3~

Recognition Vocoder for Narrow-Band Speech Transmission", THE JOU~NAI OF THE ACOUSTICAL SOCIETY OF AMERICA, Vol. 30, No. 8, August 1958, pp. 733 to 739, or the report by Raj Reddy and Robert Watkins: "USE OF SEGMENTATION AND LABELINS
IN ANALYSIS-SYNTHESIS OF SPEECH", International Conference on Acoustics Speech and Signal Processing (ICASSP), 1977, pp. 28 to 32.
The LSP type pattern matching vocoder comprises selecting the most similar reference pattern to an input pattern among predetermined reference patterns by collating (matching LSP coefficients analyzed on an LSP analyzer with those of the reference pattern, transmitting it to the synthesis side together with a sound source information.
This is well known recently as a method capable of further information compression, and can be so constituted easily by adding a parttern matching function and a decoding function to an LPC vocoder.
A parameter space distance is employed as a pattern matching measure in the LSP type pattern matching vocoder.
LSP coefficient can be regarded as a space vector as in the case of LPC, PARCOR coefficients, and the reference pattern most approximate to LSP coefficient of an input speech signal is selected my estimating the distances. The distance between LSP in~oxmation which is a space vector is indicated my a spectral distance Ei j given in the following expression:

~Q3~

Ei j ~0 ¦ Si(w ) - Sj(~ )~ d w ............... (1) where Si(~ ) and Sj~) indicate lagarithmic vectors of frames i and j which are functions of a frequency In order to select the reference pattern most approximate to a spectral enveloping of the input speech signal among a reference pattern group registered beforehand, a calculation of the spectral distance according to the expression (1) will be carried out for all frames. However, the arithmetic operation may run really vast in volume. Therefore, the spectral distance Ei j given by the following expression (2) is generally used as a matching measure Ei j = ~k~ Pk Pk (2) k=l where Pk(i) and Pk(j) indicate LSP coefficient vectors having S dimensions in frame i and j, respectively, and Wk indicates a weighting coefficient as LSP spectral sensitivity which is determined according to each LSP coefficient Pk.
A degree of the LSP coeff:cient corresponds to the degree of a all-pole digital filter for onstituting a vocal carrier filter to be realized by the LSP coefficient.
In the all-pole digital fllter of S degree, S pieces of line

2 3 ok .~ '5 called LSP frequency are used. The LSP spectral sensitivity Wk indicates a degree of spectral change caused by~an infinlt~simal change I, of the LSP coefficient of S degree, for which LSP frequency spectral sensitivity determined in response to LSP frequency is normally used.
A distance calculation according to the expression (2) is carried out by obtaining the square sum of a difference between LSP coefficient pi of K-th which is a space feature vector of the analyzed input speech signal and a space feature vector Pk(j) registered as the reference pattern at every LSP coefficients of each degree, and then multiplying the square sum by Wk which is predetermined at every LSP
frequencies corresponding to the degree of LSP coefficient.
A described above, in the conventional distance calculation according to the expression (2), an LSP frequency spectral sensitivity determined by the LSP frequency is utilized as the weighting coefficient Wko However, it has been confirmed that the LSP frequency spectral sensitivity also depends on LSP frequency interval. Therefore, the spectral distance calculation carrled out simpLy according to the expression (2) is not satisfactory as a matching ~0 measured and deteriorates a quality of the synthesized voice.

SUMMARY OE THE INVENTION
-An object of this invention is, therefore, to provide a variable frame length vocoder capable of synthesizing a synthesized speech more natural auditorily.

Another o:Eject of this inverltion is to provide a vocoderin Rich infor~nation can be further canpressed.
In accordance with the pxesent invention, a variable frame length vocoder comprises, on an analysis side, means for obtaining a feature vector from an input speech signal at every given time lengths (frames) and extracting and storing the feature vectors in a given section having a predetermined number of frames, and is characterized in that a change in the feature vectors in the given section is approximated with a given number of flat sections variable in the section length and separated from the neighboring sections by a constant time length portion and inclined sections connecting the neighboring flat sections with inclined lines in the constant time length portion, the feature vector of given frames in each flat section being outputted as a representative vector of the flat sections, and the number of frames present in the flat section being outputted as a repeat signal.
The other objects and features of the present invention will become more apparent from the following description when taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. lA and Fig. lB illustxate the principle of the present invention;

~L~D~J a3~

Fig. 2 is a diagram explaining procedures to determine the representative frames and frame intervals;
Fig. 3 is a block diagram of a one embodiment of the present invention;
5Fig. 4A and Fig. 4B are partial block diagrams of the vocoder according to another embodiment of the present invention; and Figs. 5 and 6 are partial block diagrams of the vocoder according to the present invention.

A time change characteristic of the speech waveform varies at every speakers and also at every speeches in the same speaker. However, the change is caused chiefly by a change in speech time length of a speech sound steady part.
The speech time length of a consonant portion and the transition portion between a consonant and a vowel is comparatively stable A portion whereat a feature of the speech quickly changes is considered to be, in most cases, the transition portion, and its length is comparatively constant as mentioned above. Then, a change of transfer function is abrupt correlatively with a dynamic characteristic of the LSP synthesis filter, and a portion which is problematical from having no interpolation carried out therefor comes in the transition portion, in the majority of cases.

:

In the present invention, a predetermined section, for example, 200 mSEC of an input speech signal is divided into a plurality of inclined sections and a plurality of non-inclined sections at the analysis side. The time length of the transition portion between a consonant and a vowel is assumed to be constant for the inclined sections, and the inclined section length and the assumed time length art made to correspond with each other. On the other hand, for the non-inclined sections, the section length is made variable so as to correspond with a characteristic of the unstable speech steady portion. In the invention, the predetermined section is subjected to an optimum trapezoidal approximation including the inclined sections and the non-inclined sections on the analysis side, and a trapezoidal interpolation of the LSP synthesis filter coefficient or the LSP parameter vector which is so necessary correspondingly to the trapezoidal approximation is carried out on the . . .
synthesis side.
This invention has an effect that an approximation characteristic complying full with an actual speech spectral change characterlstic is obtained by the optimum trapezoidal approximation at the analysis side, and that a synthesized voice more natural auditorily is obtainable at the synthesis side as the result of a spectrum of the synthesized speech coming to coincide well with that of the analyzed speech by interpolating the LSP synthesis filter coefficient _.

correspondingly to the above-mentioned approximation;
and another effect that a transfer function of the LSP
synthesis filter changes comparatively slow due to a linear approximationofthe inclined section at the synthesis side, which may result in suppressibility of an occurrence of the so-called "echo sound".
A segmental optimum trapezoidal approximation will be described, next. Fig. lA is a waveform drawing for describing a conception of the segmental optimum trapezoidal approximation. In the drawing, a curve R represents an actual change of LSP parameter vectors, and a trapezoidal stepping segment group A is that for which the curve R is subjected to optimum trapezoidal approximation. An oblique line zone, as illustrated, surrounded by the curve R and the trapezoidal stepping segment group A is a distortion of the spectrum which arises as the result of trapezoidal approximation. The optimum trapezoidal approximation is to obtain the trapezoidal stepping segment group minimizing an area of the above-mentioned zone.
Flg. lB is a waveform drawing for describing an actual segmental optimum trapezoidal approximation process.
In the drawing, FR(l) to FR(20) denote LSP parameter vectors for 20 frames analyzed at every 10 mSEC for example.
The segmental optimum trapezoidal approximation process is that for obtaining five frames and sections each represented by each of the five frames approximating ~26~39~

the 20 frames most accurately through the trapezoidal approximation (consisting ox an inclined section and a flat section). An inclined section length of the trapezoid is specified at a constant value 20 mSEC, for example, and a non-inclined section length of the trapezoid is specified as variableO
In execution of the trapezoidal approximation, the total sum of the distortions in the direction of time axis for the non-inclined section and for the inclined section is taken as an appreciated value on the result of selecting the trapezoidal stepping segment group.
The latter distortion arises as the result of the LSP
parameter vector of the frames included in the inclined section being substituted by the LSP parameter vector obtainable through linear interpolation of two sets of the representative frames adiacent to the inclined section. For all representative frame condidacies, section candidacies represented by the representative frame candidacies, and inclined sections between the adjacent two section candidacies, the total sum of distortions in the time direction is obtained, and a combination whereby the total sum is minimize is selected as an optimum combination.
In the drawing, the representative frames are five frames FR(2), FR(5), FR(9), FR(13), FR~18), the frame sections represented by each representative frames ~L%~3~

are FR(2~, FR(3), FR(5), FR~6~, FR(8) to FR(10), FR(12) to FR(14), FR(16) to FR(20), the frames included in the inclined section are FR(l), FR(4), FR(7), FR(ll¦, FR(15).
A total sum of the distortion G between the measured parameter curve R in the frames thus obtained and the approximate parameter line A is expressed by the following expression:

1 2,3 E4 + E5,6 + E7 + E8 9 + E11 + E12 13 E13,14 E15 E16,18 E17,18 E18,1g E18,20 where Ei,j is a distance defined by expression (2~ and Ek is a distance between the actual parameter and the interpolated parameter at the frame FR(K).
The optimum representative frames, the frame sections represented by the representative frames and the inclined sections present between the adjacent representation frame sections can be obtained efficiently through the dynamic programming technique as proposed in the report by Fushikida.
The examples will be taken up for description as fol]ows:
Fig. 2 shows a flow of the processin~for substituting 20 frames analyzed continuously in time, as shown in Fig.
lB, (a basic frame period is set at 10 mSEC in the embodiment, therefore the time occupied by the 20 frames will be 200 mSEC) with 5 frames most effectively. The invention uses the above-mentioned trapezoidal approximation, . . ... _ _ .

3~J~

the non-inclined section is made variable according to circumstances of the analyzed frames, and the inclined section is identified on one frame.
Now, let it be assumed that the 20 frames stand as S FR(l), FR(2), ... FR(20) in that order from the past one in time, for the convenience sake. In the embodiment, the frame FR(l) is set invariably in the inclined section, and the frames FR(2) and FR(20) are set invariably in the non-inclined section. In Fig. 2, numerals ... shown as 1st FRAME CANDIDACY indicate that the frame caiddidacies representing the first non-inclined section are frames FR(2), FR(3), ... FR(7).
For example, if the frame FR(2) represents the first non-inclined section, the frame FR(l) will be substituted by a linear interpolation parameter lPp 2 of a parameter up representing the last non-inclined section of the past 20 frames and a paraMeter iP2 f the frame FR(2). A distortion arising as a result of the substitution is expressed as G(1,2). Here, the first numeral "1" in parentheses denotes the first non-inclined section, and the second numeral "2" indicates that the frame representing the above-mentioned section is FR(2). G(1,2~ can be obtained through the expression

(4) based on the difference between a measured parameter Pk(l) of the frame FR(l) and an interpolation parameter pk(p~2)-__ Jo G(1~2) = Wk (p(l) _ p(p,2))2 .,......

Here, p(1) is a vector element of a parameter IPl =
(pal) p(l) p(l), ,.. p(l)) of the frame FR(l), and p(p,2) is a vector element of a linear interpolation parameter IP = (p(p,2) p~p~2) p(p~2) p(p~2)) of the parameters IPp and IP2. Then, each element of 'IPp 2 is calculated from )Pp = (P(P), P2P), ... P(P), ... PsP3) p(2) p(2) p(2~, ... p(2)) according to the following expression (5):
p(p~2) = l(p(P) + p(2) (5) Wk in the expression (4) is a weighting coefficient.
Similarly, if FR(3) is a frame representing the first non-inclined sectlon, the frame FR(l) is substituted by ; a linear interpolation parameter IPp 3 between the parameter ~Pp and lP3 of the frame FR(3):which is calculated likewise as the expression (5), and since the frame FR(2) is included in the non-inclined section : represented by the frame FR(3), the parameter IP2 is substituted by ~3. A distortion arising as a result of the substitution is shown by the following expression (6), accordingly:

~ID3~

&(1,3) = Wk(Pk Pk + Wk(p(2) - Pk )) ~.~.................. (6) Further, if the frame FR(7) is a frame xepresenting the first non-inclined section, the frame FRl is substituted by a linear interpolation parameter ~Pp 7 of the parameter IPp and the parameter IP7 of the : frame FR(7) which is calculated likewise as the expression (5), and since the frames FR(2),F~(3),FR(4),FR~, FR(~are included in the non-inclined section represented by the frame FR~7), parameters IP2, IP3, IP4, IP5, IP6 are substituted by the parameter IP7- A distortion G(1,7) arising as a result of the substitution is shown likewise by the following expression (7):

G(1,7) = Wk~Pk Pk + ( Wk (p(FR) p(7) 2 : FR=2 k=1 In Fig. 2 numerals , , ..., shown as the : 2nd FRAME CANDIDACY indicate that candidacies of the frames representing the second non-inclined section are : FR(4), FR(5), ... , FR¦14).:

:

For example, let it be assumed that FR(4) represents the second non-inclined section, then the frame to represent the first non-inclined section is FR(2) necessarily, and FR(3) is included in the non-inclined section. That the 2nd FRAME CANDIDACY and the 1st FRAME CANDIDACY are connected through a straight line indicates the above-mentioned relation. If FR(4) is a frame to represent the second non-inclined section, then a distortion G(2,4) arising as a result of the rame substitution due to FR(43 having been selected can be obtained through the following expression (8) using G(1,2) given hereinabove.

G(2,4) = G~1,2) + D2 4 ............................ (8) where, D2 is a distortion due to the substitution of the frames FR(2) to FR(4), that is, the substitution of a paxameter )P3 of the frame FR(33 by the linear interpolation parameter IP2 4 of a parameter IP2 of FR(2) and a parameter IP4 of FR(4).
Next, assuming that the frame FR(5) represents the second non-inclined section, then the frames FR( 2) and FR(3) are conceivable as frame candidacies to represent the first non-inclined section. Connection through a straight line between the second FRAME CANDIDACY
and the first FRAME CANDIDACIES and xepresents the above-mentioned relation. When selecting the frame 3~

FR(4) as a frame candidacy representing the second non-inclined section, as the frame candidacy representing the first non-inclined section the frame having smaller distortion is selected of the frames FR(2) and FR(3).
The distortion G(2,5) can be given by the following expression (9);

G(1,2) + D
G(2,5) = min 2,5 .................................. (9) G(1,3) + D3,5 where, D3 5 is a distortion determined likewise as D2 4~ and D2 5 is the minimum distortion to arise as a result of the substitution of the frames FR2 -to FR5.
The minimum distortion refers to the smaller distortion of the dis-tortions obtained by the frame substitution in which the inclined section is identified to FR(3) or FR(4), that is, it refers to a distortion given by the I: following expression (lO):

3) -Pt )) + Wk(Pk ) - P( )) D2~5 min (4) (2 5) 2 S (3) (2~ 2 k=l k=l ......... (10) Here, the first term on the right side of expression (lO) indicates a substitution distortion of the frame FR~3) or FR(4) included in the inclined section, and the second 3.~
- lB -term on the right side indicates a distortion arising as a result of the frame FR(4) or FR(3) included in the non-inclined section being substituted by the frame FR(5) or FR(2). Then, if the frame candidacy representiny the second non-inclined section is identified to FR(5) according to the expression (10), the frame representing the first non-inclined section is determined. Further, the section to be represented by the frame determined 8S above is also readily determined.
Similarly, if the frame FR(6) is identified to the frame candidacy to represent the second non-inclined section, a distortion G(2,6) is given by the following expression (11) as in the case of expression (9).
G(1,2) D2,6 G(2,6) = min! G(1,3) D3~6 ................. ,....... (11) G(1,4) + D4,6 D2 6 is then given by the following expression (12) as minimum value of the distortion to arise when the fxame candidacies to be substituted to the inclined section as in the case of expression ~10) are identified to F~), F~4),F~(5)-~3~

W ( (3) p(2,6) )2 (P (FR) _ p (63 ) 2 FR=4, 5 k=1 Wk (P (4) _ p (2 ,6) 2 2 ,6 m n + I' Wk ( p ( 3 ) _ p 2 ) ) + I, Wk ( P ( ) - p ( ) W up _ p(2,6) )2 + I` ( Z' Wk (P (FR) _ p (2) 2 FR=3,4 k=l ..... :.... (12) Here, the first term on the right side of the expression (12) indicates a substitution distortion of the frame FR(3), : FR(4) or FR(5) included in the inclined section, and the second term on the right side indicates a distortion arising as a result of (1) FR(4), FR(5),. (2) F~(3), FR(~, or (3) FR(3), FR(~.) included in the non-inclined section being substituted by 15 l FRY (2) FR(2)and FR(C), or (3) FR(~ respectively.
: : D3 6 and D4 6 are also determined as in the case of expressions (10) and (6~.

__ ~.203~

When the frame candidacy representing the second non-inclined section is identified to FR(6) according to the processes of calculation of D3 6 and also of the expressions (11) and (12), the frame representing the first non inclined section and the section represented by the frame to represent the first non~inclined section are determined simultaneously.
Similarly, when FR(7), FR(8), ..., FR(14) are identified to the frame candidacies representing the second non-inclined section, distortions G(2,7), G~2,8), ..., G(2,14) according to each frame substitution, frames representing the first non-inclined section, and the section represented by the frames representing the first non-inclined section are determined successively.
Furthermore, distortions G(3,6), G(3,7), I.. , G(3,16) according to each frame substitution by FR(6), FR(7), ....
FR(16) shown in the 3rd FRAME CANDIDACY of Fig. 2, frames representing the corresponding second non-inclined section, and the section represented by the frames representing the second non-inclined section are determined successively.
Next, distortions G(5,14), G(5,15), ..., G~5,20) corresponding to the frame candidacies FR~14), FR~15) ....
FR(20) representing the fifth (last non-inclined section shown in the 5th FRAME CANDIDACY through determination of the 4th FRAME CANDIDACY, frames representing the corresponding fourth non-inclined section, and the section represented by the frames representing the fourth non-inclined section are determined successively.
Lastly, an optimum frame is determined from among frame candidacies FR(14), FR(15), ..., FR(20) representing the fifth non-inclined section according to the following expression (13):

G(5,14) Wk(Pk Pk ) ) G ( 5 1 15 ) + ( Wk I Pk Pk Gmin min G(5,18) ( W (p(FR~ _ p(l8))2 FR=19 k=l k G(5,19) k-l Wk( k Pk G15,20) :
........ ~13) : Wherein, the second term on the right side of the expression (13~ indicates a distortion arising as a result of the sections FR(15) to FR(20), FR(16) to FR(20) being substituted by the frame candidacies FR~14), FR(15) representing the fifth non-inclined section.

3~

Frames representing the fifth, fourth, third, second, and first non-inclined sections are determined through the above processing, and section lengths represented by each representative frame are also determined.
In other words, frames included in the inclined section are determined. Thus, a parameter signal IP of the representative frames and a repeat bit signal giving a number M of the frames included in the representative section represented thereby are obtained.
It is noted here that the setting of FR~2) to FR(7) as the 1st FRAME CANDIDACY and FR~4) to FR ( 14, as the 2nd FRAME CANDIDACY is determined automatically by limiting the maximum frame interval and frame candidacies differen-t from Fig. 2 can easily be set by selecting the maximum frame interval optionally.
Now, a constitution of the vocoder given in one embodiment of this invention will be described with reference to Fig. 3. The entities may employ members used for known vocoders the LSP vocoder (disclosed, for example, in the report by Itakura et al.).
An analysis side 302 is constituted of a low-pass filter & A/D converter 303, a window processor 304~ an LSP parameter analyzer 3CS, a sound source analyzer 306, a DP processor 307, an LSP parameter memory 308, and a coder 309. A synthesis side 311 lS constituted of a decoder 312, a pulse generator 313, a noise generator 314, - ,:

___ ~3~

a V-UV change-over switch 315, a sound source amplitude regulator 316, an LSP synthesis filter 317, a D/A converter & low-pass filter 318, and an interpolator 319.
A speech signal coming through an input terminal 301 has a voice band limited, for example, to 3.4 kHz and then sampled at 8 kHz and quantized by the low-pass filter & A/D converter 303. A sampled signal is supplied to the window processor 304. The window processor 304 stores temporarily a siynal obtainable through multiplying the sampled signal by a predetermined window function and outputs the result to the LSP parameter analyzer 305 and the sound information analyzer 306 with 240 samples unitized to 1 block. The block is produced, for example, at every 10 mSEC. The LSP parameter analyzer 305 determines an LSP parameter vector from the speech signal supplied at every 10 m5EC through a known technique like that of being given in the report by Itakura et al.
as mentioned.
The DP processor 307 handles a continuous I set (I being 20, for example) out of the sequence of LSP
parameter vectors supplied from the LSP parameter analyzer 305 as one segment, obtains N pieces (N being 5, for example) of representative frames through operations of the above-mentioned expressions (4) to (13~ and a repeat bit signal indicating the number M of frames present in the non-inclined section represented by the 3.~

representative frames, and then outputs the result to the coder 309. Here, it is noted that a start f.rame of one segment locates at the inclined section and an end frame locates at the non-inclined section.
Consequently, LSP parameter vector of the N-th representative frame in one previous section to the present section becomes necessary for DP operation.
The LSP parameter memory 308 stores temporarily the ASP parameter vector of the N-th representative frame in the one previous section selected by the DP
processor 307, and outputs the LSP parameter vector stored at the time of DP processing of the present section.
The coder 309 quantizes N pieces of LSP parameter vectors and a repeat number M supplied from the DP
processor 307, and supplies the quantized signals to the synthesis side 311 through a transmission path 310 together with a sound source information parameter.
The sound source information analyzer 306 extracts pitch information, V-UV information power information and the like rom the voice signal supplied from the window processor 304 according to a known technique, : and outputs to the coder 309.
The decoder 312 decodes a coded LSP parameter vector and the like and outputs pitch information of the sound source information to the pulse generator 313, ~o~

V-UV information to the V-UV change-cver switch 315 and power information to the sound source amplitude regulator 316. The decoder 312 further outputs LSP parameter vector to the known LSP synthesis filter 317 through the interpolator 319 according to the repeat number M
of the section represented by the LSP parameter vector and al50 outputs LSP parameter vector interpolated by the interpolator 319 to the LSP synthesis filter 317 according to a fixed inclined section length.
The pulse generator 313 supplies a sequence of pitch pulses based on the pitch information to the V-UV change-over switch 315. The noise generator generates and outputs a white noise to the switch 315.
The switch 315 supplies an output of the pulse generator 313 to the sound source amplitude regulator 316 when the V-UV information indicates a voiced sound and an output of the noise generator 314 thereto when an unvoiced sound is indicated. The sound source amplitude regulatox 316 regulates the amplitude of a signal supplied from the switch 315 correspondingly to the power information and outputs the result to the LSP synthesis filter 317 as a sound source signal of the LSP synthesis filter.
As the LSP synthesis filter 317, one example is shown by Fig. 9.2 and Fig. 9.3 given in Paragraph 9.2 "Line Spectrum Pair", "BASIS OF SOUND INFORMATION", by Shuzo 5aito and Kazuo Nakata, published by OHM-SHA ON
November 30, 1981.

AL rZ(11l 3~

The D/~ converter & low-pass filter 318 converts thus obtained dlgital speech signal into a continuous (analogue) speech waveform, removes an unnecessary frequency component, and outputs a synthesized speech to an output terminal 320.
Next, another embodiment applied to a pattexn matching vocoder with LSP parameter will be described.
As described above, in the pattern matching vocoder using LSP parameter as a spectrum information of the voice, a spectral sensitivity is used as a weighting coefficient Wk to obtain the spectral distance shown in the expression (2), however, it has been confirmed experimentally that spectral sensitivity varies according to LSP frequency interval. Therefore, to : 15 use the weighting coefficient specified as a function only for spectral sensitivity is to invite a deterioration on the synthesized voice.
Now, therefore, in this embodiment, a more practical pattern matching is secured by specifying the weighting coefficient as a function not only for LSP spectral sensitivity but also for LSP frequency interval, thus improving a quality of the synthesized speech. It has been then confirmed that an influence to be exerted on the weighting coefficient is remarkable only where the frequency interval is short, therefore the LSP frequency interval of an analysis frame will have to be checked beforehand for determining the weighting coefficient, and thus a frequency interval sensitivity will be considered only where the frequency interval below a constant value is included.
Fig. 4A and Fig. 4B are block diagrams of an analysis side and a synthesis side representing an embodiment of this invention. In the drawings, like members are identified by the same reference numerals as Fig. 3. What is different from Fig. 3 is that the analysis side has a pattern matching portion for outputting a reference pattern label selected through pattern matching by means of LSP parameter obtained on the DP processor 307, comprising a pattern matching processor 410~ a reference pattern memory 411, a spectral 15 sensitivity memory 412, a frequency interval memory 413~
a minimum length register 414, a label register 415, and that the synthesis side has a pattern decoder 420 receiving a label decoded on the decoder 312 and outputting LSP parameter whLch constitutes the reerence pattern specified in the label by a reference pattern memory 421 storing the same contents as the reference pattern memory 411 to the interpolator 319.
A detailed description will be made of the pattern matching division on the analysis side with reference to Fig. 4A. The reference pattern memory 411 stores a distribution content of a standard LSP coefficient of 3~

the speech obtainable through LSP analysis of aspeech data prepared beforehand. The operation is normally called "clustering" and is particularly described as "segmentation" in the report by Raj Reddy and Robert Watkins. The operation will be epitomized as follows:
First, a preprocessing, removing a silent section, removing an unnecessary near-by framel and classifying by voice sound, unvoiced sound and silence, fo.^ a pxepareds~eech data is carried ou-t through LPC analysis or the like.
In this case, a frame period is given, for example, at 10 mSEC, and a tag code for voiced sound, unvoiced sound, silence, or transition sound between voiced sound and unvoiced sound is given at every frames.
Next, the silent frame is removed, the remaining frames are separated lnto voiced sound and unvoiced sound, and the transition sound will be included in elther or both of voiced sound and unvoiced sound. Furthermore, the frame close in time and smaller in spectral distance is removed, thus the number of necessary samples is curtailed, and then these are classified at every spectral distances set beforehand according to a reference pattern selecting technique known hitherto, registered and stored as reference patterns.

For the reference pattern technique mentioned above, it is assumed that a space U of ten-dimensional LSP
coefficient consists, for example, of N pieces of patterns in the case of this embodiment, the above-mentioned spectral distance is measured or each of the N-piece patterns, that of having a distance below the spectral distance value ad~2 set beforehand is obtained for all the N-piece patterns, and a pattern PL having a maximum pattern number Mi (i = 1, 2, ...., N) is determined. The pattern PL with the spectral distance coming below the value ~dB2 set beforehand is removed from the space U of ten-dimensional coefficient, then PL is registered as a reference pattern, and such operation is carried out repeatedly until there is no pattern included in the space U, thus registering it as a reference pattern. The reference pattern thus obtained normally runs several thousand kinds and is stored in the memory 411 with address (label) given thereon.
A frequency sensitivity Ws and a frequency interval ; sensitivity Ww of the LSP paramenter read out of the reference pattern memory 411 which must be subjected to pattern matching are stored in the spectral sensitivity memory 412 and the frequency interval sensitivity memory 413.
Both the sensitivities Ws and Ww will be obtainable experimentally beforehand.

~,26~

A readou L of data from the reference pattern memory ~11, the spectral sensitivity memory ~12 and the frequency interval sensitivity ~elmory 413 is carried out as follows:
For example, a vector ~( ) of the r-th reference pattern of two thousand reference patterns expressed in S-dimensional vector will be given;

P (r) = (p(r) p(r) p(r) p(r)) Jo read out the ? -th member p(r) which constitutes the \

\~

\\

: \
\

\

r-th reference pattern vector from the reference pattern memory 411, signals indicating r and will be selected as a readout signal. On the other hand, from supplying signal to the spectral sensitivity memory 412 and the frequency interval memory 413~ the sensitivities Wsl Ww determined on the frequency corresponding to the Q -th LSP vector member are outputted from the memories.
The pattern matching is a processing for determining a spectral distance between an input pattern from the DP processor 307 and a reference pattern read out sequentially from the reference pattern memory 411 and for selecting the reference pattern indicating the minimum distance The processing is carried out by use of the pattern matching : processor 410, the minimum length register 414, and the label register 415. A calculation of the spectral distance is carried out according to the following expression (14) in this embodiment despite being based on the expression (2) hitherto.

D~J = Dij = a- Ww2 ABS(p(r) p(r) ...... (14) where Dij indicates the distance DLj = k~l Wk(Pk Pk )' expressed by expression (2), a denotes a weighting coefficient to determine which to use preferably a frequency ~f~)3~

spectral sensitivity or a frequency interval sensitivity for obtaining a better result on selecting the reference pattern, and an optimum value is determined experimentally.
Ww~ represents a frequency interval sensitivity relating to vector member pi, ABS( ) represents an absolute value in the parentheses, and b denotes a constant corresponding to the period threshold value for which the frequency interval sensitivity must be taken into consideration, which is obtainable experimentally.
Now, the minimum length register 414 and the label register are initialized at maximum value and "0", respectively, according to the frame period signal.
LSP parameter vector OR of the representative frame from the DP processor 307 is supplied to the processor 410.
An address signal r for reading out the reference patterns sequentially and a vector member specifying signal Q are supplied to the reference pattern memory 411 from the processor 41G. A member or) which constitutes the r-th reference pattern spectrum IP(r~ is read out sequentially from the memory 411 according to this readout signal. All the reference patterns are read out by changing r from 1 to a prepared reference pattern number and further changing Q from 1 to S for each r. Then, the vector member specifying signal is supplied to the spectral sensitivity memory 412 and the frequency interval memory 413, therefore the sensitivity constants Ws and Ww according to the specified member p(r) are read out.
Thus, the distance of the expression (14) is calculated first by changing from 1 to S for the first reference pattern, the calculated distance and the content stored in the minimum length register 414 are compared with each other, and where the calculated distance is smaller, the content stored in the register 414 is substituted by the calculated distance, which is so stored. On the other hand, a label (r for example) of the r-th reference pattern is written in the label register.
The label rR stored in the label register 415 after the above processing is carried out on all the reference patterns is such reference pattern label as is most analogous to the pattern consisting of LSP parameter included in the representative frame supplied to the : processor 410, and the label signal rR is supplied to the coder 309. The repezt bit slgnal M outputted from the DP
processor 307 is also supplied to the coder 309. The above processing is carried ou-t on the pattern constituting the representative frame in the representative frame section of the variable length frame.
The above various signals transmitted from the analysis side are decoded on the decoder 312 of the synthesis side, and those other than the label signal rR are inputted to each member as in the case of Fig. 3. The same reference pattern as that on the analysis side which is specified by rR out of the reference pattern memory 421 is read out and decoded by the pattern decoder 420 as shown in Fig. 4B.
Thus decoded pattern is supplied to the interpolator 319 as a representative frame vector (rR). Constitution and operation of the other entities are same as Fig. 3.
The above embodiment uses the expression (14) in which the frequency period spectral sensitivity Ww is taken into consideration for all the reference patterns to obtain the spectral distance. However, as mentioned above, since Ww scarcely exerts an influence on the spectral distance when the frequency interval is small, whether or not the frequency has a period below a predetermined frequency interval will be decided on each reference pattern when the spectral distance is calculated, and if not, then the conventional spectral distance calculating expression (2) may be used, but if yes, the expression (14) can be usedO
In this case, a predetermined number of reference patterns are selected from smaller one of the distances obtained through the expression (2) as a pattern candidacy, and the spectral distance is calculated according to the expression (14) only for the selected pattern candidacy.
This method is advantageous in a phase of operation quantity.
The embodiment will be then described as follows:
This embodiment has a constitution for which the constitution given in Fig. 4A is replaced by Fig. 5.
In the drawing, a reference pattern memory 511, a frequency spectral sensitivity memory 512, a freauency interval spectral sensitivity memory 513, minimum length registers 514, 514', and label registers 515, 515' have a similar function to the members shown in Fig. 4, however, what is different is that the registers 514 and 515 store the above predetermined number of distances and labels.
Pattern candidacy registers 516, 517 store the above predetermined number of pattern candidacies.
A first processor 510 decides whether or not the interval below a predetermined value (obtainable experimentally for example, 0.025 (rad)) is included in the sequence of LSP frequencies of a vector constituting the reference pattern read out of the reference pattern memory 511. If not included, then the first processor 510 carries out a spectral distance operation according to the expression (2) using the frequency spectral sensitivity only and supplies the label signal rR of the reference pattern which is mot similar to the coder 309 through a technique similar to Fig. 4. As described, parenthesises in the expression (14) is represented by the sensitivity W determined on frequency interval of the Eirst and second LSP parameters.
On the other hand, if included a predetermined number (2 for example) of pattern candidacies are selected preliminarily in the first processor 510 from among the prepared reference patterns. In other words, he predetermined number of reference patterns smaller in that order are taken up for pattern candidacy by use of distance information obtained according to the expression (2).
Spectral distances thus selected are denoted by Dl, D2, .~., Di. If Dl~ D2, the frequency interval spectral sensitivity is not particularly to be used, therefore the reference pattern whereby the distance Dl is obtained is supplied to the coder 309. If not D14~ D2, when Rj defined as:
Rj = Dj/Dl (j = 2, 3, ...., i) leaves the reference pattern coming within a threshold value (can be set experimentally and set at 1.2 to 3.0 normally) only as a pattern candidacy and makes the pattern candidate memory 517 store the information.
A second processor 520 has a function almost the same as the pattern matching processor in Fig. 4: a pattern matching is performed between LSP information from the DP
processor 307 and that of the pattern candidacy read out of the pattern candidate memory 517, and the pattern having minimum length is taken out of the pattern candidacies as a pattern for the above~mentioned representative frame.
The label rR indicating the pattern having minimum length is supplied to the coder 30g. The spectral distance calculation is carried out here according to the expression ~14) in which the frequency interval spectral sensitivity Ww is taken into consideration.

The constitution of the analysls side in another embodiment of this invention which is given in Fig. 6 is intended for determining the reference patterns effectively to comprise constituting tne reference pattern memory in the analysis side of the embodiment shown in Fig. 4A of a plurality of reference pattern files classified according to the LSP frequency interval of the speech data, selecting first the reference pattern Nile with the frequency interval of LSP parameter obtainable through subjecting the input speech signal to LSP analysis working as a standard, determining the reference pattern by measuring the spectral distance between LSP frequency stored in the reference pattern file and LSP frequency obtained from the input speech signal, providing a means for transmitting a designation cod data of the reference pattern file thus obtained and a designation code data of the reference pattern from the analysis side to the synthesis side.
In Fig. 6, reference pattern files 611(1), 611~2), 611(3), ...., 611(I) are those of having each a frequency interval of a plurality of LSP information set beforehand according to the speech data.
LSP information supplied from the DP processor 307 measures LSP frequency interval which is set beforehand on an LSP period instrument 613, or an interval between l 25 and I! 2 of 10-dimensional LSP frequencies l ~V'2~
~10 particularly in this embodiment, and sends it to a reference pattern selector 612.

The reference pattern selector 612 reads contents stored in the reference pattern files 611(1) to 611(I), determines the reference pattern file having the most approximate ASP frequency interval, and sends a reference pattern file designation code data which designates a number of the reference pattern file to the coder 309.
The reference pattern selector 612 then sends the contents stored in the determined reference pattern file to a spectral distance instrument 610. The instrument 610 carries out a pattern matching through measuring a spectral distance to the LSP information of the input speech signal supplied from the DP processor 307 according to an arithmetic operation in which the frequency spectral sensitivity in the expression (2) is suhstituted by the frequency interval spectral sensitivity, selects the most approximate reference pattern number included in the determined reference pattern file, and then sends a reference pattern designation code . .
data which designates the reference pattern to the coder : 303. In a spectral distance operation in the spectral distance instrument 610, the frequency interval spectral sensitivity stored in the frequency interval spectral sensitivity memory 614 is utilized as a weighting coefficient at the timë of operation in the expression (2).
Both the data of reference pattern file designation : 25 code and reference pattern designation code which are transmitted from the analysis side to the synthesis side If us through the coder 309 are utilized on the synthesis side together with the sound source information and the repeat bit data, thus reproducing the input speech signal.
The synthesis side (not illustrated) has the reference pattern memory 421 shown in Fig. 4B replaced by the reference pattern files 611(1) to 611(K) shown in Fig. 6 in constitution, the reference pattern is reproduced and decoded as supplying both the data of reference pattern file designation code and reference pattern designation code to the decoder 312, and the synthesis processing can be carried out otherwise exactly in the contents described with reference to Fig. 4B.
In LSP type pattern matching vocoder, this embodiment of the present invention is characterized fundamentally in that LSP frequency interval spectral sensitivity is utilized as a weighting coefficient in the spectral distance measurement in addition to LSP frequency spectral sensitivity utilized hitherto, and thus the input speech signal can be synthesized conscientiously in case a spectral distance between LSP information of the reference pattern and LSP
information obtainable through analyzing the input speech signal is measured to a matching measure; other variants are also conceivable in many ways For example, LSP information obtained by the LSP
analyzer 18 is computed through a high degree equation process at the analysis side in each embodiment described above, however, it can be carried out by a zero-point search process well known together with the high degree equation process, and the LSP information is analyzed and extracted at every variable length frames, but the variable length frame can be made as a fixed length frame as occasion demands.

Claims

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE
PROPERTY OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:

1. A variable frame length vocoder comprising means for obtaining a feature vector from an input speech signal at every given frame, means for extracting the feature vectors in a given section having a predetermined number of frames, means for approximating a change in said feature vectors in said given section with a given number of flat sections variable in the section length and separated from neighboring sections by a constant time length portion and inclined sections connecting said neighboring flat sections with inclined lines in said constant time length portion, means for outputting the feature vector of a given frame in each flat section as a representative vector of said flat section and means for outputting the number of frames present in said flat section as a repeat signal.

2. The variable frame length vocoder according to Claim 1, further comprising, on a synthesis side, means for producing the feature vector in each of said inclined sections through interpolating by means of the representative vectors of the flat sections present on both sides of said inclined section.

3. The variable frame length vocoder according to Claim 1, wherein said flat sections and their representative vectors are obtained through a dynamic programming process carried out between a feature vector change expressed by said flat section and inclined section and a feature vector change of actual input speech.

4. The variable frame length vocoder according to Claim 1, wherein said feature vector is a LSP parameter vector.

5. The variable frame length vocoder according to Claim 1, further comprising, on the synthesis side, a synthetic filter driven by said representative vector and said repeat signal.

6. The variable frame length vocoder according to Claim 4, further comprising a memory storing LSP information obtained at every given length frames for a speech data prepared beforehand as a reference pattern, a pattern matching means for calculating a distance between LSP
information of said representative frame and LSP information of said reference pattern to output a label signal indicating the reference pattern having minimum distance.

7. The variable frame length vocoder according to Claim 6, wherein distance calculation in said pattern matching means is carried out by means of a weighting coefficient dependent on frequency of said LSP information.

8. The variable frame length vocoder according to Claim 6, wherein distance calculation in said pattern matching means is carried out by means of a weighting coefficient dependent on frequency interval of said LSP information.

9. The variable frame length vocoder according to Claim 7, wherein distance calculation in said pattern matching means is carried out by means of a weighting coefficient dependent on frequency and frequency interval of said LSP
information.

10. The variable frame length vocoder according to Claim 6, further comprising, on the synthesis side, means for receiving said label signal, and means for outputting the reference pattern designated by the label.

11. The variable frame length vocoder according to Claim 9, wherein said pattern matching means including:
a first pattern matching means for carrying out the pattern matching by means of the weighting coefficient dependent on frequency of said LSP information, means for deciding whether or not the frequency interval of said LSP information exceeds a predetermined threshold value, means for outputting the label signal indicating the reference pattern obtained through said first pattern matching means when the frequency interval is equal to or exceeds said threshold value, and outputting a predetermined number of reference patterns as candidate patterns in such a manner that the reference pattern having the minimum distance and those being the distance close to the minimum distance are successively outputted in that order when the frequency interval comes below said threshold value, and a second pattern matching means for carrying out pattern matching with the weighting coefficient dependent on LSP frequency interval by means of distance information, to output the label signal indicating the pattern having the minimum distance among said candidate patterns.

12. The variable frame length vocoder according to Claim 4, further comprising:
a memory for storing a plurality of reference patterns having a given frequency interval, means for obtaining the frequency interval from said obtained LSP information, a reference pattern selecting means for selecting a given reference pattern from said plurality of reference patterns in response to the obtained frequency interval, and a pattern matching means for carrying out pattern matching with the weighting coefficient dependent on the frequency interval from said input LSP information and LSP information of said selected reference pattern to output the label signal indicating the obtained reference pattern having the minimum distance.