EP1197952A1

EP1197952A1 - Coding method of the prosody for a very low bit rate speech encoder

Info

Publication number: EP1197952A1
Application number: EP01402684A
Authority: EP
Inventors: Philippe Thales Intellectual Property Gournay; Yves-Paul Thales Intellectual Property Nakache
Original assignee: Thales SA
Current assignee: Thales SA
Priority date: 2000-10-18
Filing date: 2001-10-17
Publication date: 2002-04-17
Anticipated expiration: 2021-10-17
Also published as: CA2359411C; FR2815457B1; JP2002207499A; ATE450856T1; EP1197952B1; DE60140651D1; KR20020031305A; IL145992A0; CA2359411A1; US20020065655A1; FR2815457A1; ES2337020T3; US7039584B2

Abstract

The speech coding decoding system has a step of learning to identify speech signal representatives and a coding step segmenting the speech signals, and determining the best associated representation. There is a step of coding/decoding of one parameter from the recognised information segment set which is the best representation of energy or pitch and/or closeness and/ or segment length.

Description

La présente invention concerne un procédé de codage de la parole à très bas débit et le système associé. Elle s'applique notamment pour des systèmes de codage-décodage de la parole par indexation d'unités de taille variable.The present invention relates to a method for coding the speech at very low speed and the associated system. It applies in particular for speech coding-decoding systems by unit indexing of variable size.

Le procédé de codage de la parole mis en oeuvre à bas débit, par exemple de l'ordre de 2400 bits/s, est généralement celui du vocodeur utilisant un modèle totalement paramétrique du signal de parole. Les paramètres utilisés concernent le voisement qui décrit le caractère périodique ou aléatoire du signal, la fréquence fondamentale des sons voisés encore connue sous le vocable anglo-saxon « PITCH », l'évolution temporelle de l'énergie, ainsi que l'enveloppe spectrale du signal généralement modélisée par un filtre LPC (abréviation anglo-saxonne de Linear Predictive Coding).The speech coding method implemented at low bit rate, by example of the order of 2400 bits / s, is generally that of the vocoder using a fully parametric model of the speech signal. The parameters used relate to voicing which describes the character periodic or random signal, the fundamental frequency of voiced sounds still known by the English term "PITCH", the evolution temporal energy, as well as the spectral envelope of the signal generally modeled by an LPC filter (abbreviation for Anglo-Saxon Linear Predictive Coding).

Ces différents paramètres sont estimés périodiquement sur le signal de parole, typiquement toutes les 10 à 30 ms. Ils sont élaborés au niveau d'un dispositif d'analyse et sont généralement transmis à distance en direction d'un dispositif de synthèse reproduisant le signal de parole à partir de la valeur quantifiée des paramètres du modèle.These different parameters are estimated periodically over the speech signal, typically every 10 to 30 ms. They are developed in level of an analysis device and are generally transmitted remotely in direction of a synthesis device reproducing the speech signal from of the quantized value of the model parameters.

Jusqu'à présent, le plus bas débit normalisé pour un codeur de parole utilisant cette technique est de 800 bits/s. Ce codeur, normalisé en 1994 est décrit par le standard OTAN STANAG 4479 et dans l'article intitulé « NATO STANAG 4479 : A standard for an 800 bps vocoder and channel coding in HF-ECCM system », IEEE Int. Conf. on ASSP, Detroit, pp 480-483, May 1995 ayant pour auteurs Mouy, B., De La Noue, P., et Goudezeune,G. Il repose sur une technique d'analyse trame par trame (22.5 ms) de type LPC 10 et exploite au maximum la redondance temporelle du signal de parole en regroupant les trames 3 par 3 avant encodage des paramètres.So far, the lowest standard rate for a coder speech using this technique is 800 bits / s. This encoder, standardized in 1994 is described by the NATO standard STANAG 4479 and in the article entitled "NATO STANAG 4479: A standard for an 800 bps vocoder and channel coding in HF-ECCM system ”, IEEE Int. Conf. on ASSP, Detroit, pp 480-483, May 1995 whose authors are Mouy, B., De La Noue, P., and Goudezeune, G. he is based on a frame by frame analysis technique (22.5 ms) of the type LPC 10 and makes maximum use of the temporal redundancy of the signal speech by grouping the frames 3 by 3 before encoding the parameters.

Bien qu'intelligible, la parole reproduite par ces techniques de codage est d'assez mauvaise qualité et n'est plus acceptable à partir du moment où le débit est inférieur à 600 bits/s. Although intelligible, the speech reproduced by these techniques of coding is of fairly poor quality and is no longer acceptable from when the bitrate is less than 600 bits / s.

Une manière de réduire le débit est d'utiliser les vocodeurs segmentaux de type phonétiques avec des segments de durée variable qui combinent des principes de reconnaissance et de synthèse de la parole.One way to reduce the speed is to use vocoders phonetic type segmentals with segments of variable duration which combine principles of speech recognition and synthesis.

La procédure d'encodage utilise essentiellement un système de reconnaissance automatique de la parole en flot continu, qui segmente et « étiquète » le signal de parole selon un nombre d'unités de parole de taille variable. Ces unités phonétiques sont codées par indexation dans un petit dictionnaire. Le décodage repose sur le principe de la synthèse de la parole par concaténation à partir de l'index des unités phonétiques et de la prosodie. Le terme « prosodie » regroupe principalement les paramètres suivants : l'énergie du signal, le pitch, une information de voisement et éventuellement le rythme temporel.The encoding procedure essentially uses a system of automatic speech recognition in continuous flow, which segments and "Labels" the speech signal according to a number of speech units of size variable. These phonetic units are coded by indexing in a small dictionary. Decoding is based on the principle of speech synthesis by concatenation from the index of phonetic units and the prosody. The term "prosody" mainly includes parameters following: signal energy, pitch, voicing information and possibly the time rhythm.

Toutefois, le développement des codeurs phonétiques nécessite des connaissances importantes en phonétique et en liguistique, ainsi qu'une phase de transcription phonétique d'une base de données d'apprentissage qui est coûteuse et qui peut être la source d'erreurs. De plus, les codeurs phonétiques s'adaptent difficilement à une nouvelle langue ou à un nouveau locuteur.However, the development of phonetic coders requires significant knowledge of phonetics and liguistics, as well as phonetic transcription phase of a learning database which is expensive and which can be the source of errors. In addition, the coders phonetics hardly adapt to a new language or a new speaker.

Une autre technique, décrite par exemple dans la thèse de J.Cernocky, intitulée « Speech Processing Using Automatically Derived Segmental Units: Applications to very Low Rate Coding and Speaker Verification » de l'Université Paris XI Orsay, décembre 1998 permet de contourner les problèmes liés à la transcription phonétique de la base de données d'apprentissage en déterminant les unités de parole de façon automatique et indépendamment de la langue.Another technique, described for example in the thesis of J. Cernocky, entitled "Speech Processing Using Automatically Derived Segmental Units: Applications to very Low Rate Coding and Speaker Verification ”of the University Paris XI Orsay, December 1998 allows to work around problems with phonetic transcription of the database learning data by determining the speech units so automatic and regardless of language.

Le fonctionnement de ce type de codeur se décompose principalement en deux étapes : une étape d'apprentissage et une étape de codage-décodage décrites à la figure 1.The operation of this type of encoder is broken down mainly in two stages: a learning stage and a coding-decoding described in figure 1.

Lors de l'étape d'apprentissage (figure 1), une procédure automatique détermine par exemple après une analyse paramétrique 1 et une étape de segmentation 2, un ensemble de 64 classes d'unités acoustiques désignées « UA ». A chacune de ces classes d'unités acoustiques est associé un modèle statistique 3, de type modèle de Markov (HMM abréviation anglo-saxonne de Hidden Markov Model), ainsi qu'un petit nombre d'unités représentantes d'une classe, désignées sous le terme « représentants» 4. Dans le système actuel, les représentants sont simplement les 8 unités les plus longues appartenant à une même classe acoustique. Ils peuvent également être déterminés comme étant les N unités plus représentatives de l'unité acoustique. Lors du codage d'un signal de parole après une étape d'analyse paramétrique 5 permettant d'obtenir notamment les paramètres spectraux, les énergies, le pitch, une procédure de reconnaissance (6, 7), à l'aide d'un algorithme de Viterbi, détermine la succession d'unités acoustiques du signal de parole et identifie le « meilleur représentant » à utiliser pour la synthèse de parole. Ce choix se fait par exemple en utilisant un critère de distance spectrale, tel que l'algorithme de DTW (abréviation anglo-saxonne de Dynamic Time Warping).
Le numéro de la classe acoustique, l'indice de cette unité représentante , la longueur du segment, le contenu de DTW et les informations prosodiques issues de l'analyse paramétrique sont transmises au décodeur. La synthèse de la parole se fait par concaténation des meilleurs représentants, éventuellement en utilisant un synthétiseur paramétrique de type LPC.During the learning step (FIG. 1), an automatic procedure determines for example after a parametric analysis 1 and a segmentation step 2, a set of 64 classes of acoustic units designated "UA". Each of these classes of acoustic units is associated with a statistical model 3, of the Markov model type (HMM abbreviation Anglo-Saxon de Hidden Markov Model), as well as a small number of units representing a class, designated under the term “representatives” 4. In the current system, representatives are simply the 8 longest units belonging to the same acoustic class. They can also be determined as being the N units most representative of the acoustic unit. During the coding of a speech signal after a parametric analysis step 5 making it possible in particular to obtain the spectral parameters, the energies, the pitch, a recognition procedure (6, 7), using an algorithm of Viterbi, determines the succession of acoustic units of the speech signal and identifies the "best representative" to use for speech synthesis. This choice is made for example by using a spectral distance criterion, such as the DTW algorithm (abbreviation for Dynamic Time Warping).
The number of the acoustic class, the index of this representative unit, the length of the segment, the content of DTW and the prosodic information resulting from the parametric analysis are transmitted to the decoder. Speech synthesis is done by concatenation of the best representatives, possibly using a parametric LPC type synthesizer.

Pour concaténer les représentants lors du décodage de la parole, on fait appel, par exemple, à un procédé d'analyse/synthèse paramétrique de la parole. Ce procédé paramétrique permet notamment des modifications de prosodie telles que l'évolution temporelle, la fréquence fondamentale ou pitch, par rapport à une simple concaténation de formes d'onde.To concatenate the representatives during speech decoding, we use, for example, a parametric analysis / synthesis process of the speech. This parametric process notably allows modifications of prosody such as time evolution, fundamental frequency or pitch, compared to a simple concatenation of waveforms.

Le modèle paramétrique de parole utilisé par le procédé d'analyse/synthèse peut être à excitation binaire voisé/ non voisé de type LPC 10 tel que décrit dans le document intitulé « The government standard linear predictive coding algorithm : LPC-10 » de T.Tremain publié dans la revue Speech Technology, vol.1, n°2, pp 40-49.The parametric speech model used by the process analysis / synthesis can be binary excitation voiced / unvoiced type LPC 10 as described in the document entitled "The government standard linear predictive coding algorithm: LPC-10 ”by T.Tremain published in the Speech Technology, vol.1, n ° 2, pp 40-49.

Cette technique permet de coder l'enveloppe spectrale du signal en 185 bits/s environ pour un système monolocuteur, pour une moyenne d'environ 21 segments par seconde.This technique makes it possible to code the spectral envelope of the signal in around 185 bits / s for a single-speaker system, for an average about 21 segments per second.

Dans la suite de la description les termes ci-après ont les significations suivantes :

le terme « représentant » correspond à l'un des segments de la base d'apprentissage qui a été jugé représentatif d'une des classes d'unités acoustique,
l'expression « segment reconnu » correspond à un segment de la parole qui a été identifié comme appartenant à l'une des classes acoustiques, par le codeur,
l'expression « meilleur représentant » désigne le représentant déterminé au niveau du codage qui représente le mieux le segment reconnu.

In the following description, the following terms have the following meanings:

the term “representative” corresponds to one of the segments of the learning base which has been deemed representative of one of the classes of acoustic units,
the expression “recognized segment” corresponds to a segment of speech which has been identified as belonging to one of the acoustic classes, by the coder,
the expression “best representative” designates the representative determined at the coding level which best represents the recognized segment.

L'objet de la présente invention concerne un procédé de codage, décodage de la prosodie pour un codeur de parole à très bas débit utilisant notamment les meilleurs représentants.The object of the present invention relates to a coding method, prosody decoding for a very low bit rate speech coder using including the best representatives.

Il concerne aussi la compression de données.It also relates to data compression.

L'invention concerne un procédé de codage-décodage de la parole utilisant un codeur à très bas débit comprenant une étape d'apprentissage permettant d'identifier des « représentants » du signal de parole et une étape de codage pour segmenter le signal de parole et déterminer le « meilleur représentant » associé à chaque segment reconnu. Il est caractérisé en ce qu'il comporte au moins une étape de codage-décodage d'un des paramètres au moins de la prosodie des segments reconnus, tel que l'énergie et/ou le pitch et/ou le voisement et/ou la longueur des segments, en utilisant une information de prosodie des « meilleurs représentants ».The invention relates to a speech coding-decoding method. using a very low bit rate coder including a learning step allowing to identify “representatives” of the speech signal and a step coding to segment the speech signal and determine the "best representative ”associated with each recognized segment. It is characterized in that that it includes at least one coding-decoding step of one of the parameters of at least the prosody of the recognized segments, such as the energy and / or the pitch and / or the voicing and / or the length of the segments, in using prosody information from the "best representatives".

L'information de prosodie des représentants utilisée est par exemple le contour d'énergie ou le voisement ou la longueur des segments ou le pitch.The prosody information of the representatives used is by example the energy contour or the voicing or the length of the segments or the pitch.

L'étape de codage de la longueur des segments reconnus consiste par exemple à coder la différence de longueur entre la longueur d'un segment reconnu et la longueur du « meilleur représentant » multiplié par un facteur donné.The step of coding the length of the recognized segments consists for example in coding the difference in length between the length of a recognized segment and the length of the "best representative" multiplied by a given factor.

Selon un mode de réalisation, il comporte une étape de codage de l'alignement temporel des meilleurs représentants en utilisant le chemin de DTW et en recherchant le plus proche voisin dans une table de formes.According to one embodiment, it includes a step of coding the time alignment of the best representatives using the path of DTW and looking for the nearest neighbor in a shape table.

L'étape de codage de l'énergie peut comporter une étape de détermination pour chaque début de « segment reconnu » de la différence ΔE(j) entre la valeur d'énergie E_rd(j) du « meilleur représentant » et la valeur d'énergie E_sd(j) du début du « segment reconnu » et l'étape de décodage comporter pour chaque segment reconnu, une première étape consistant à translater le contour d'énergie du meilleur représentant d'une quantité ΔE(j) pour faire coïncider la première énergie E_rd(j) du « meilleur représentant » avec la première énergie E_sd(j+1) du segment reconnu d'indice j+1.The energy coding step may include a step of determining for each start of a “recognized segment” the difference ΔE (j) between the energy value E _rd (j) of the “best representative” and the value d energy E _sd (j) from the start of the “recognized segment” and the decoding step include, for each recognized segment, a first step consisting in translating the energy contour of the best representative by a quantity ΔE (j) to make coincide the first energy E _rd (j) of the "best representative" with the first energy E _sd (j + 1) of the recognized segment of index j + 1.

L'étape de codage de voisement comporte par exemple une étape de détermination des différences existantes ΔT_k pour chaque extrémité d'une zone de voisement d'indice k entre la courbe du voisement des segments reconnus et celle des meilleurs représentants et l'étape de décodage comporte par exemple pour chaque extrémité d'une zone de voisement d'indice k une étape de correction de la position temporelle de cette extrémité d'une valeur ΔT_k correspondante et/ou une étape de suppression ou d'insertion d'une transition.The voicing coding step comprises for example a step of determining the existing differences ΔT _k for each end of a voicing area of index k between the voicing curve of the recognized segments and that of the best representatives and the step of decoding comprises for example for each end of a voicing area of index k a step of correcting the time position of this end with a corresponding value ΔT _k and / or a step of deleting or inserting a transition .

Le procédé concerne aussi un système de codage-décodage de la parole comportant au moins une mémoire pour stocker un dictionnaire comprenant un ensemble de représentants du signal de parole, un microprocesseur adapté pour déterminer les segments reconnus, pour reconstruire la parole à partir des « meilleurs représentants » et pour mettre en oeuvre les étapes du procédé selon l'une des caractéristiques précitées.The method also relates to a coding-decoding system for the speech comprising at least one memory for storing a dictionary comprising a set of representatives of the speech signal, a microprocessor suitable for determining the recognized segments, for reconstruct speech from the "best representatives" and to put implementing the process steps according to one of the aforementioned characteristics.

Le dictionnaire des représentants est par exemple commun au codeur et au décodeur du système codage-décodage.The representative dictionary is for example common to coder and decoder of the coding-decoding system.

Le procédé et le système selon l'invention peuvent être utilisés pour le codage-décodage de la parole pour des débits inférieurs à 800 bits/s et de préférence inférieurs à 400 bits/s.The method and the system according to the invention can be used for speech coding and decoding at bit rates below 800 bits / s and preferably less than 400 bits / s.

Le procédé et le système de codage-décodage selon l'invention offrent notamment l'avantage de coder à très bas débit la prosodie et de fournir ainsi un codeur complet dans ce domaine d'application.The coding-decoding method and system according to the invention offer in particular the advantage of coding prosody at very low speed and of thus providing a complete encoder in this field of application.

D'autres caractéristiques et avantages apparaítront à la lecture de la description détaillée d'un mode de réalisation pris à titre d'exemple non limitatif et illustré par les dessins annexés où :

la figure 1 représente un schéma d'apprentissage, de codage et de décodage de la parole selon l'art antérieur,
les figures 2 et 3 décrivent des exemples de codage de la longueur des segments reconnus,
la figure 4 schématise un modèle d'alignement temporel des « meilleurs représentants »,
les figures 5 et 6 montrent des courbes des énergies du signal à coder et des représentants alignés, ainsi que les contours des énergies initial et décodé obtenus en mettant en oeuvre le procédé selon l'invention,
la figure 7 schématise le codage du voisement du signal de parole, et
la figure 8 est un exemple de codage du pitch.

Other characteristics and advantages will appear on reading the detailed description of an embodiment taken by way of nonlimiting example and illustrated by the appended drawings where:

FIG. 1 represents a diagram of learning, coding and decoding of speech according to the prior art,
FIGS. 2 and 3 describe examples of coding the length of the recognized segments,
FIG. 4 schematizes a model of temporal alignment of the "best representatives",
FIGS. 5 and 6 show curves of the energies of the signal to be coded and of the aligned representatives, as well as the contours of the initial and decoded energies obtained by implementing the method according to the invention,
FIG. 7 diagrams the coding of the voicing of the speech signal, and
Figure 8 is an example of pitch coding.

Le principe de codage selon l'invention repose sur l'utilisation des « meilleurs représentants », notamment leur information de prosodie, pour coder et/ou décoder au moins un des paramètres de prosodie d'un signal de parole, par exemple le pitch, l'énergie du signal, le voisement, la longueur des segments reconnus.The coding principle according to the invention is based on the use of "Best representatives", including their prosody information, for coding and / or decoding at least one of the prosody parameters of a signal speech, for example pitch, signal energy, voicing, length recognized segments.

Pour compresser la prosodie à très bas débit, le principe mis en oeuvre utilise la segmentation du codeur ainsi que les informations prosodiques des « meilleurs représentants ».To compress prosody at very low speed, the principle work uses encoder segmentation as well as information prosodic "best representatives".

La description qui suit donnée à titre illustratif et nullement limitatif décrit un procédé de codage de la prosodie dans un dispositif de codage-décodage de la parole à faible débit qui comporte un dictionnaire obtenu de façon automatique, par exemple, lors de l'apprentissage tel que décrit à la figure 1.The following description given by way of illustration and in no way limitative describes a method for coding prosody in a coding-decoding device low-speed speech that includes a dictionary obtained from automatically, for example, during learning as described in figure 1.

Le dictionnaire comprend les informations suivantes :

plusieurs classes d'unités acoustiques UA, chaque classe étant déterminée à partir d'un modèle statistique,
pour chaque classe d'unités acoustiques, un ensemble de représentants.

The dictionary includes the following information:

several UA acoustic unit classes, each class being determined from a statistical model,
for each class of acoustic units, a set of representatives.

Ce dictionnaire est connu du codeur et du décodeur. Il correspond par exemple à une ou plusieurs langues et à un ou plusieurs locuteurs.This dictionary is known to the coder and the decoder. he corresponds for example to one or more languages and to one or more speakers.

Le système de codage-décodage comporte par exemple une mémoire pour stocker le dictionnaire, un microprocesseur adapté pour déterminer les segments reconnus, pour la mise en oeuvre des différentes étapes du procédé selon l'invention et pour reconstruire la parole à partir des meilleurs représentants.The coding-decoding system comprises for example a memory to store the dictionary, a microprocessor suitable for determine the recognized segments, for the implementation of the different steps of the method according to the invention and for reconstructing speech from best representatives.

Le procédé selon l'invention met oeuvre au moins une des étapes suivantes: le codage de la longueur des segments, le codage de l'alignement temporel des « meilleurs représentants », le codage et/ou le décodage de l'énergie, le codage et/ou le décodage de l'information de voisement et/ou le codage et/ou le décodage du pitch et/ou le décodage de la longueur des segments et de l'alignement temporel.The method according to the invention implements at least one of the steps following: the coding of the length of the segments, the coding of the time alignment of the “best representatives”, the coding and / or the energy decoding, encoding and / or decoding information from voicing and / or encoding and / or decoding of pitch and / or decoding of the length of the segments and the time alignment.

Codage de la longueur des segmentsSegment length coding

Le système de codage détermine en moyenne un nombre Ns de segments par seconde, par exemple 21 segments. La taille de ces segments varie en fonction de la classe d'unités acoustiques UA. Il apparaít que pour la majorité des UA, le nombre de segments décroít selon une relation 1/ x^2.6, où x est la longueur du segment.The coding system determines on average a number Ns of segments per second, for example 21 segments. The size of these segments varies depending on the class of UA acoustic units. It appears that for the majority of UA, the number of segments decreases according to a relation 1 / x ^2.6 , where x is the length of the segment.

Une variante de réalisation du procédé selon l'invention consiste à coder la différence de longueur variable entre le « segment reconnu » et la longueur du « meilleur représentant » selon un schéma décrit à la figure 2.An alternative embodiment of the method according to the invention consists in code the difference in variable length between the "recognized segment" and the length of the “best representative” according to a diagram described in Figure 2.

Sur ce schéma dans la colonne de gauche figure la longueur du mot de code à utiliser et dans la colonne de droite la différence de longueur entre la longueur du segment reconnu par le codeur pour le signal de parole et celle du meilleur représentant.This diagram in the left column shows the length of the code word to use and in the right column the length difference between the length of the segment recognized by the coder for the speech signal and that of the best representative.

Selon un autre mode de réalisation donnée à la figure 3, le codage de la longueur absolue d'un segment reconnu est effectué à l'aide d'un code à longueur variable semblable à celui de Huffman connu de l'Homme du métier, ce qui permet d'obtenir un débit de l'ordre de 55 bits/s.According to another embodiment given in FIG. 3, the coding of the absolute length of a recognized segment is carried out using a code of variable length similar to that of Huffman known to man of the profession, which makes it possible to obtain a bit rate of around 55 bits / s.

Le fait d'utiliser les longs mots de code pour coder les longueurs de grands segments reconnus, permet notamment de conserver la valeur de débit dans une plage de variation limitée. En effet, ces longs segments réduisent le nombre de segment reconnu par seconde et le nombre de longueurs à coder.Using long code words to code lengths recognized large segments, in particular, retains the value of flow in a limited variation range. Indeed, these long segments reduce the number of segments recognized per second and the number of lengths to code.

En résumé, on code par exemple avec un code à longueur variable la différence entre la longueur du segment reconnu et la longueur du meilleur représentant multiplié par un certain facteur, ce facteur pouvant être compris entre 0 (codage absolu) et 1 (codage de la différence). In summary, we code for example with a length code variable the difference between the length of the recognized segment and the length of the best representative multiplied by a certain factor, this factor can be between 0 (absolute coding) and 1 (difference coding).

Codage de l'alignement temporel des meilleurs représentantsCoding of the best representatives' time alignment

L'alignement temporel est par exemple réalisé en suivant le chemin de la DTW (abréviation anglo-saxonne de Dynamic Time Warping) qui a été déterminé lors de la recherche du « meilleur représentant » pour coder le « segment reconnu ».The time alignment is for example carried out by following the DTW path (Anglo-Saxon abbreviation for Dynamic Time Warping) which was determined when looking for the "best representative" for code the "recognized segment".

La figure 4 représente le chemin ( C) de la DTW correspondant au contour temporel qui minimise la distorsion entre le paramètre à coder (axe des abscisses), par exemple le vecteur des coefficients « cepstraux », et le « meilleur représentant » (axe des ordonnées). Cette approche est décrite dans le livre ayant pour titre « Traitement de la parole », pour auteur René Boite et Murat Kunt publié aux Presses Polytechnique Romandes éditions 1987.FIG. 4 represents the path (C) of the DTW corresponding to the time contour which minimizes the distortion between the parameter to be coded (axis abscissas), for example the vector of “cepstral” coefficients, and the "Best representative" (ordinate axis). This approach is described in the book entitled “Word processing”, for author René Box and Murat Kunt published by Presses Polytechnique Romandes éditions 1987.

Le codage de l'alignement des « meilleurs représentants » est effectué par recherche du plus proche voisin dans une table contenant des formes type. Le choix de ces formes type se fait par exemple par une approche statistique, telle que l'apprentissage sur une base de données de parole ou par une approche algébrique par exemple la description par des équations mathématiques paramétrables, ces différentes méthodes étant connues de l'Homme du métier.The coding of the alignment of the "best representatives" is performed by finding the nearest neighbor in a table containing type forms. The choice of these standard forms is made for example by a statistical approach, such as learning on a database of speech or by an algebraic approach for example the description by configurable mathematical equations, these different methods being known to those skilled in the art.

Selon une autre approche, valable dans le cas où les segments de petite taille sont en proportion importante, le procédé effectue un alignement des segments suivant la diagonale plutôt que le chemin exact de la DTW. Le débit est alors nul.According to another approach, valid in the case where the segments of small size are in significant proportion, the process performs a alignment of segments along the diagonal rather than the exact path of DTW. The flow is then zero.

Codage-décodage de l'énergieEnergy coding and decoding

Lorsque l'on classe et analyse les segments de la base de données de parole appartenant à chacune des classes d'unités acoustiques, on constate qu'il se dégage une certaine cohérence dans la forme des contours des énergies. De plus, il existe des ressemblances entre les contours d'énergie des meilleurs représentants alignés par DTW et les contours de l'énergie du signal à coder.When we classify and analyze the base segments of speech data belonging to each of the classes of acoustic units, we can see that there is a certain consistency in the form of contours of energies. In addition, there are similarities between energy contours of the best representatives aligned by DTW and contours of the energy of the signal to be coded.

Le codage de l'énergie est décrit ci-après en relation aux figures 5 et 6, où l'axe des ordonnées correspond à l'énergie du signal de la parole à coder exprimée en dB et l'axe des abscisses au temps exprimé en trames. The energy coding is described below in relation to FIGS. 5 and 6, where the ordinate axis corresponds to the energy of the speech signal at code expressed in dB and the abscissa axis at time expressed in frames.

La figure 5 représente la courbe (III) regroupant des contours d'énergie des meilleurs représentants alignés et la courbe (IV) des contours d'énergie des segments reconnus séparés par des * sur la figure. Un segment reconnu d'indice j est délimité par deux points de coordonnées respectives [E_sd(j) ; T_sd(j)] et [E_sf(j) ; T_sf(j)] où E_sd(j) est l'énergie de début de segment et E_sf(j) l'énergie de fin de segment, pour les instants T_df et T_sf correspondant. Les références E_rd(j) et E_rf(j) sont utilisées pour les valeurs d'énergies du début et de la fin d'un « meilleur représentant » et la référence ΔE(j) correspond à la translation déterminée pour un segment reconnu d'indice j.FIG. 5 represents the curve (III) gathering the energy contours of the best aligned representatives and the curve (IV) of the energy contours of the recognized segments separated by * in the figure. A recognized segment of index j is delimited by two points of respective coordinates [E _sd (j); T _sd (j)] and [E _sf (j); T _sf (j)] where E _sd (j) is the start of segment energy and E _sf (j) the end of segment energy, for the instants T _df and T _sf corresponding. The references E _rd (j) and E _rf (j) are used for the energy values of the beginning and the end of a "best representative" and the reference ΔE (j) corresponds to the translation determined for a recognized segment index j.

Energy coding

Le procédé comporte une première étape de détermination de la translation à réaliser.The method comprises a first step of determining the translation to be carried out.

Pour cela on détermine pour chaque début de « segment reconnu », la différence ΔE(j) existant entre la valeur d'énergie E_rd(j) du meilleur représentant (courbe III) et la valeur d'énergie E_sd du début du segment reconnu (courbe IV). On obtient un ensemble de valeurs ΔE(j) que l'on quantifie par exemple uniformément de manière à connaítre la translation à appliquer lors du décodage. La quantification est réalisée par exemple en utilisant des méthodes connues de l'Homme du métier.For this, the difference ΔE (j) existing between the energy value E _rd (j) of the best representative (curve III) and the energy value E _sd at the start of the segment is determined for each start of the “recognized segment”. recognized (curve IV). We obtain a set of values ΔE (j) that we quantify for example uniformly so as to know the translation to be applied during decoding. The quantification is carried out for example using methods known to those skilled in the art.

Speech signal energy decoding

Le procédé consiste notamment à utiliser les contours d'énergie des meilleurs représentants (courbe III) pour reconstruire les contours d'énergie du signal à coder (courbe IV).The method notably consists in using the energy contours of the best representatives (curve III) to reconstruct the contours energy of the signal to be coded (curve IV).

Pour chaque segment reconnu, une première étape consiste à translater le contour d'énergie du meilleur représentant pour la faire coïncider avec la première énergie E_rd(j) en lui appliquant la translation ΔE(j), définie à l'étape de codage par exemple, pour déterminer la valeur E_sd(j). Après cette première étape de translation, le procédé comporte une étape de modification de la pente du contour d'énergie du meilleur représentant afin de relier la dernière valeur d'énergie E_rd(j) du « meilleur représentant » à la première énergie E_sd(j+1) du segment suivant d'indice j+1.For each recognized segment, a first step consists in translating the energy contour of the best representative to make it coincide with the first energy E _rd (j) by applying to it the translation ΔE (j), defined in the coding step by example, to determine the value E _sd (j). After this first translation step, the method comprises a step of modifying the slope of the energy contour of the best representative in order to link the last energy value E _rd (j) of the "best representative" to the first energy E _sd (j + 1) of the next segment of index j + 1.

La figure 6 représente les courbes (VI) et (VII) correspondant respectivement au contour d'énergie original du signal de parole à coder et du contour d'énergie décodé après mise en oeuvre des étapes décrites précédemment.FIG. 6 represents the curves (VI) and (VII) corresponding respectively to the original energy contour of the speech signal to be coded and of the energy contour decoded after implementation of the steps described previously.

Par exemple, le codage des énergies de début de chaque segment sur 4 bits permet d'obtenir pour le codage segmental de l'énergie un débit de l'ordre de 80 bits/s.For example, coding the start energies of each segment on 4 bits provides for segmental energy coding a bit rate of around 80 bits / s.

Codage de l'information de voisementVoice information coding

La figure 7 représente l'évolution temporelle d'une information de voisement binaire de quatre segments successifs 35, 36, 37 pour le signal à coder courbe (VII) et pour les meilleurs représentants (courbe VIII) après alignement temporel par DTW.FIG. 7 represents the temporal evolution of information of binary voicing of four successive segments 35, 36, 37 for the signal to code curve (VII) and for the best representatives (curve VIII) after time alignment by DTW.

Voice information coding

Lors du codage, le procédé exécute une étape de codage de l'information de voisement, par exemple en parcourant l'évolution temporelle de l'information de voisement des segments reconnus et celle des meilleurs représentants alignés (courbe VIII) et en codant les différences existantes ΔT_k entre ces deux courbes. Ces différences ΔT_k peuvent être : une avance a de la trame, un retard b de trame, l'absence et/ou la présence d'une transition référence c (k correspond à l'indice d'une extrémité d'une zone de voisement).During coding, the method performs a coding step of the voicing information, for example by browsing the temporal evolution of the voicing information of the recognized segments and that of the best aligned representatives (curve VIII) and by coding the differences existing ΔT _k between these two curves. These differences ΔT _k can be: an advance a of the frame, a delay b of the frame, the absence and / or the presence of a reference transition c (k corresponds to the index of an end of a zone of voicing).

Pour cela, il est possible d'utiliser un code de longueur variable dont un exemple est donné dans la table I ci-dessous, pour coder la correction à apporter à chacune des transitions de voisement pour chacun des segments reconnus. Tous les segments ne comportant pas de transition de voisement, il est possible de réduire le débit associé au voisement en ne codant que les transitions de voisement existantes dans le voisement à coder et dans les meilleurs représentants.For this, it is possible to use a variable length code an example of which is given in table I below, to code the correction to be made to each of the voicing transitions for each recognized segments. All segments with no transition voicing, it is possible to reduce the flow associated with voicing by not coding that the existing voicing transitions in the voicing at code and in the best representatives.

Selon cette méthode, l'information de voisement est codée sur environ 22 bits par seconde. Exemple de table de codage pour les transitions de voisement : Code Interprétation 000 Transition à supprimer 001 Décalage 1 trame à Droite 010 Décalage 1 trame à Gauche 011 Décalage 2 trames à Droite 100 Décalage 2 trames à Gauche 101 Insérer une transition (un code précisant l'emplacement de la transition suit celui-ci) 110 Pas de décalage 111 Déplacement supérieur à 3 trames (un autre code suit celui-ci) According to this method, the voicing information is coded on approximately 22 bits per second. Example of coding table for voicing transitions: Coded Interpretation 000 Transition to be deleted 001 1 frame shift to Right 010 Offset 1 frame to the Left 011 Offset 2 frames to the Right 100 Offset 2 frames to the Left 101 Insert a transition (a code specifying the location of the transition follows this one) 110 No lag 111 Displacement greater than 3 frames (another code follows this one)

Pour une information de voisement mixte telle que :

le taux de voisement en sous-bande, l'analyse de cette information fait appel à une méthode décrite par exemple dans le document suivant : "Multiband Excitation Vocoders", ayant pour auteurs D.W. Griffin and J.S. Lim, IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 36, no. 8, pp. 1223-1235, 1988 ;
la fréquence de transition entre une bande basse voisée et une bande haute non-voisée, le codage utilise une méthode telle que décrite dans le document ayant pour auteurs C. Laflamme, R. Salami, R. Matmti, and J-P. Adoul, intitulé "Harmonic Stochastic Excitation (HSX) speech coding below 4 kbits/s", IEEE International Conférence on Acoustics, Speech, and Signal Processing, Atlanta, May 1996, pp. 204-207.

For mixed voicing information such as:

the voicing rate in sub-band, the analysis of this information uses a method described for example in the following document: "Multiband Excitation Vocoders", having as authors DW Griffin and JS Lim, IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 36, no. 8, pp. 1223-1235, 1988;
the transition frequency between a voiced low band and a non-voiced high band, the coding uses a method as described in the document having as authors C. Laflamme, R. Salami, R. Matmti, and JP. Adoul, entitled "Harmonic Stochastic Excitation (HSX) speech coding below 4 kbits / s", IEEE International Conférence on Acoustics, Speech, and Signal Processing, Atlanta, May 1996, pp. 204-207.

Dans ces deux cas, le codage de l'information de voisement comporte également le codage de la variation de la proportion de voisement.In these two cases, the coding of the voicing information comprises also the coding of the variation in the voicing proportion.

Decoding of voicing information

Le décodeur dispose de l'information de voisement des « meilleurs représentants alignés » obtenu au niveau du codeur.The decoder has the voicing information of the "Best aligned representatives" obtained at the coder level.

La correction s'effectue par exemple de la manière suivante :The correction is made, for example, as follows:

A chaque détection de l'extrémité d'une zone de voisement sur les meilleurs représentants choisis pour la synthèse, le procédé apporte une information complémentaire au décodeur qui est la correction à effectuer à cette extrémité. La correction peut être une avance a ou un retard b à apporter à cette extrémité. Ce décalage temporel est par exemple exprimé en nombre de trames afin d'obtenir la position exacte de l'extrémité de voisement du signal de parole original. La correction peut aussi prendre la forme d'une suppression ou d'une insertion d'une transition.Each time the end of a voicing zone is detected on the best representatives chosen for the synthesis, the process provides additional information to the decoder which is the correction to be made to this end. The correction can be an advance a or a delay b to bring to this end. This time difference is for example expressed in number of frames in order to obtain the exact position of the end of voicing of the original speech signal. The correction can also take the form of a deletion or insertion of a transition.

Codage du pitchPitch coding

L'expérience montre que, sur des enregistrements de parole, le nombre de zones voisées obtenues par seconde est en moyenne de l'ordre de 3 ou 4. Pour rendre compte fidèlement des variations du pitch, une manière de procéder consiste à transmettre plusieurs valeurs de pitch par zone voisée. Afin de limiter le débit, au lieu de transmettre toute la succession des valeurs de pitch sur une zone voisée, le contour du pitch est approximé par une succession de segments linéaires.Experience shows that, on speech recordings, the number of voiced areas obtained per second is on average around of 3 or 4. To faithfully account for pitch variations, a way of proceeding consists in transmitting several pitch values by voiced area. In order to limit the speed, instead of transmitting the entire succession of pitch values on a voiced area, the contour of the pitch is approximated by a succession of linear segments.

Pitch coding

Pour chaque zone voisée du signal de parole, le procédé comporte une étape de recherche des valeurs du pitch à transmettre. Les valeurs de pitch au début et à la fin de la zone voisée sont systématiquement transmises. Les autres valeurs à transmettre sont déterminées de la manière suivante :

le procédé considère uniquement les valeurs du pitch au début des segments reconnus. Partant de la droite Di joignant les valeurs du pitch aux deux extrémités de la zone voisée, le procédé recherche le début de segment dont la valeur de pitch est la plus éloignée de cette droite, ce qui correspond à une distance d_max. Il compare cette valeur d_max à une valeur seuil d_seuil. Si la distance d_max est supérieure à d_seuil, le procédé décompose la droite initiale Di en deux droites D_i1 et D_i2, en prenant le début du segment trouvé comme nouvelle valeur de pitch à transmettre. Cette opération est réitérée sur ces deux nouvelles zones voisée délimitées par les droites D_i1 et D_i2 jusqu'à ce que la distance d_max trouvée soit inférieure à la distance d_seuil.

For each voiced area of the speech signal, the method includes a step of searching for the values of the pitch to be transmitted. The pitch values at the beginning and at the end of the voiced area are systematically transmitted. The other values to be transmitted are determined as follows:

the method considers only the values of the pitch at the start of the recognized segments. Starting from the line Di joining the pitch values at the two ends of the voiced area, the method searches for the start of the segment whose pitch value is the furthest from this line, which corresponds to a distance d _max . It compares this d _max value with a threshold d _threshold value. If the distance d _max is greater than d _threshold , the method decomposes the initial line Di into two lines D _i1 and D _i2 , taking the start of the segment found as the new pitch value to be transmitted. This operation is repeated on these two new voiced zones delimited by the lines D _i1 and D _i2 until the distance d _max found is less than the distance d _threshold .

Pour coder les valeurs du pitch ainsi déterminées, le procédé utilise par exemple un quantificateur scalaire prédictif sur par exemple 5 bits appliqué au logarithme du pitch.To code the pitch values thus determined, the process uses for example a predictive scalar quantizer on for example 5 bits applied to the logarithm of the pitch.

La prédiction est par exemple la première valeur de pitch du meilleur représentant correspondant à la position du pitch à décoder, multipliée par un facteur de prédiction compris par exemple entre 0 et 1. Prediction is for example the first pitch value of the best representative corresponding to the position of the pitch to be decoded, multiplied by a prediction factor for example between 0 and 1.

Selon une autre façon de procéder, la prédiction peut être la valeur minimale de l'enregistrement de parole à coder. Dans ce cas, cette valeur peut être transmise au décodeur par quantification scalaire sur par exemple 8 bits.According to another way of proceeding, the prediction can be the minimum value of the speech recording to be coded. In this case, this value can be transmitted to the decoder by scalar quantization on by 8-bit example.

Les valeurs des pitchs à transmettre ayant été déterminées et codées, le procédé comporte une étape où l'espacement temporel est précisé, par exemple en nombre de trames, entre chacune de ces valeurs de pitch. Un code à longueur variable permet par exemple de coder ces espacements sur 2 bits en moyenne.The values of the pitches to be transmitted having been determined and coded, the method comprises a step where the temporal spacing is specified, for example in number of frames, between each of these values of pitch. A variable length code allows for example to code these 2-bit spacing on average.

Cette façon de procéder permet d'obtenir un débit d'environ 65/bits par seconde pour une distance maximale sur la période pitch de 7 échantillons.This procedure allows a flow of approximately 65 / bits per second for a maximum distance over the pitch period of 7 samples.

Pitch decoding

L'étape de décodage comporte tout d'abord une étape de décodage de l'espacement temporel entre les différentes valeurs de pitch transmises afin de récupérer les instants de mise à jour du pitch, ainsi que la valeur du pitch pour chacun de ces instants. La valeur du pitch pour chacune des trames de la zone voisée est reconstituée par exemple par interpolation linéaire entre les valeurs transmises.The decoding step firstly includes a step of decoding the time spacing between the different pitch values transmitted in order to recover the pitch update times, as well as the pitch value for each of these moments. The value of the pitch for each frames of the voiced area is reconstructed for example by interpolation linear between the transmitted values.

Claims

Speech coding-decoding method using a very low bit rate coder comprising a learning step for identifying “representatives” of the speech signal and a coding step for segmenting the speech signal and determining the “best representative »Associated with each recognized segment characterized in that it comprises at least one coding-decoding step of at least one of the parameters of the prosody of the recognized segments, such as energy and / or pitch and / or voicing and / or the length of the segments, using prosody information from the "best representatives".

Method according to claim 1 characterized in that the prosody information of the representatives used is the energy contour or the voicing or the length of the segments or the pitch.

Method according to Claim 1, characterized in that it includes a step of coding the length of the recognized segments consisting in coding the difference in length between the length of a recognized segment and the length of the "best representative" multiplied by a given factor .

Method according to Claim 1, characterized in that it includes a step of coding the time alignment of the best representatives using the DTW path and searching for the closest neighbor in a shape table.

Method according to one of Claims 1 to 4, characterized in that the energy coding step comprises a step of determining for each start of a "recognized segment" the difference ΔE (j) between the energy value E _rd (j) of the “best representative” and the energy value E _sd (j) of the start of the “recognized segment”.

Method according to claim 5 characterized in that the energy decoding step comprises for each recognized segment, a first step consisting in translating the energy contour of the best representative by a quantity ΔE (j) to make the first energy E _rd (j) of the "best representative" with the first energy E _sd (j + 1) of the recognized segment of index j + 1.

Method according to one of Claims 1 to 4, characterized in that the voicing coding step comprises a step of determining the existing differences ΔT _k for each end of a voicing area of index k between the voicing curve of the recognized segments and that of the best representatives.

Method according to Claim 7, characterized in that the decoding step comprises, for each end of a voicing zone of index k, a step of correcting the time position of this end with a corresponding value ΔT _k and / or a step of deleting or inserting a transition.

Speech coding-decoding system comprising at least one memory for storing a dictionary comprising a set of representatives of the speech signal, a microprocessor suitable for determine the recognized segments, to reconstruct speech from "Best representatives" and to implement the process steps according to one of claims 1 to 8.

System according to Claim 9, characterized in that the dictionary of representatives is common to the coder and to the decoder of the coding-decoding system.

Use of the method according to one of claims 1 to 8 or of system according to one of claims 9 and 10 for coding-decoding the speech for bit rates lower than 800 bits / s and preferably lower than 400 bits / s.