WO2007076279A2

WO2007076279A2 - Method for classifying speech data

Info

Publication number: WO2007076279A2
Application number: PCT/US2006/062032
Authority: WO
Inventors: Yi-Qing Zu; Jian-Cheng Huang; Kai-Zhi Wang
Original assignee: Motorola Inc.
Priority date: 2005-12-29
Filing date: 2006-12-13
Publication date: 2007-07-05
Also published as: WO2007076279A3; CN1991981A

Abstract

A computationally non-intensive method for classifying real-time speech data is useful for improved animations of avatars. The method includes identifying a voiced speech segment of the speech data (step 410). A high-amplitude spectrum is then determined by performing a spectral analysis on a high-amplitude component of the voiced speech segment (step 415). The high-amplitude spectrum is then classified as a vowel phoneme, where the vowel phoneme is selected from a reduced vowel set (step 440).

Description

METHOD FOR CLASSIFYING SPEECH DATA

FIELD OF THE INVENTION

The present invention relates generally to speech recognition. In particular, although not exclusively, the invention relates to analyzing and classifying speech data to assist in animating avatars.

BACKGROUND OF THE INVENTION

Speech recognition is a process that converts acoustic signals, which are received for example at a microphone, into components of language such as phonemes, words and sentences. Speech recognition is useful for many functions including dictation, where spoken language is translated into written text, and computer control, where software applications are controlled using spoken commands.

A further emerging application of speech recognition technology is the control of computer generated avatars. According to Hindu mythology, an avatar is an incarnation of a god that functions as a mediator with humans. In the virtual world of electronic communications, avatars are cartoon-like, "two dimensional" or "three dimensional" graphical representations of people or various types of creatures. As a "talking head", an avatar can enliven an electronic communication such as a voice call or email by providing a visual image that presents the communication to a recipient. For example, text of an email can be "spoken" to a recipient through an avatar using speech synthesis technology. Also, a conventional telephone call, which transmits only acoustic data from a caller to a callee, can be converted to a quasi video conference call using speaking avatars. Such quasi video conference calls can be more entertaining and informative for participants than conventional audio-only conference calls, but require much less bandwidth than actual video data transmissions.

Quasi video conferences using avatars employ speech recognition technology to identify language components in received audio data. For example, an avatar displayed on a screen of a mobile phone can animate the voice of a caller in real-time. As the caller's voice is projected over a speaker of the phone, speech recognition software in the phone identifies language components in the caller's voice and maps the language components to changes in the graphical representation of a mouth of the avatar. The avatar thus appears to a user of the phone to be speaking, using the voice of the caller in real-time.

SUMMARY OF THE INVENTION

According to one aspect, the invention is a method for classifying speech data. The method includes identifying a voiced speech segment of the speech data. A high- amplitude spectrum is then determined by performing a spectral analysis on a high- amplitude component of the voiced speech segment. The high-amplitude spectrum is then classified as a vowel phoneme, where the vowel phoneme is selected from a reduced vowel set.

Thus, using the present invention, improved animations of avatars are possible using real-time speech data. The methods of the present invention are less computationally intensive than most conventional speech recognition methods, which enables the methods of the present invention to be executed faster while using fewer processor resources.

BRIEF DESCRIPTION OF THE DRAWINGS

In order that the invention may be readily understood and put into practical effect, reference now will be made to exemplary embodiments as illustrated with reference to the accompanying figures, wherein like reference numbers refer to identical or functionally similar elements throughout the separate views. The figures together with a detailed description below, are incorporated in and form part of the specification, and serve to further illustrate the embodiments and explain various principles and advantages, in accordance with the present invention, where:

FIG. 1 is a schematic diagram illustrating a mobile device in the form of a radio telephone that performs a method of the present invention;

FIG. 2 is a graph and associated spectral plots illustrating speech data that are received and processed, for example at a mobile device, according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram illustrating functional components of a speech classification and mouth movement mapping process, according to an embodiment of the present invention; and

FIG. 4 is a general flow diagram illustrating a method for classifying speech data according to an embodiment of the present invention.

Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.

DETAILED DESCRIPTION

Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in combinations of method steps and apparatus components related to methods for classifying speech data. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that arc pertinent to understanding the embodiments of the present invention, so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.

In this document, relational terms such as left and right, first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by "comprises a . . ." does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element. Referring to FIG. 1 , a schematic diagram illustrates a mobile device in the form of a radio telephone 100 that performs a method of the present invention. The telephone 100 comprises a radio frequency communications unit 102 coupled to be in communication with a processor 103. The telephone 100 also has a keypad 106 and a display screen 105 coupled to be in communication with the processor 103. As will be apparent to a person skilled in the art, screen 105 may be a touch screen thereby making the keypad 106 optional.

The processor 103 includes an encoder/decoder 111 with an associated code Read Only Memory (ROM) 112 storing data for encoding and decoding voice or other signals that may be transmitted or received by the radio telephone 100. The processor 103 also includes a micro-processor 113 coupled, by a common data and address bus 117, to the encoder/decoder 111, a character Read Only Memory (ROM) 114, a Random Access Memory (RAM) 104, static programmable memory 116 and a SIM interface 118. The static programmable memory 116 and a SIM operatively coupled to the SIM interface 118 each can store, amongst other things, selected incoming text messages and a Telephone Number Database TND (phonebook) comprising a number field for telephone numbers and a name field for identifiers associated with one of the numbers in the name field. For instance, one entry in the Telephone Number Database TND may be 91999111111 (entered in the number field) with an associated identifier "Steven C! at work" in the name field.

The micro-processor 113 has ports for coupling to the keypad 106 and screen 105 and an alert 115 that typically contains an alert speaker, vibrator motor and associated drivers. Also, micro-processor 113 has ports for coupling to a microphone 135 and communications speaker 140. The character Read only memory 114 stores code for decoding or encoding text messages that may be received by the communications unit 102. In this embodiment the character Read Only Memory 114 also stores operating code (OC) for the micro-processor 113 and code for performing functions associated with the radio telephone 100.

The radio frequency communications unit 102 is a combined receiver and transmitter having a common antenna 107. The communications unit 102 has a transceiver 108 coupled to antenna 107 via a radio frequency amplifier 109. The transceiver 108 is also coupled to a combined modulator/demodulator 110 that couples the communications unit 102 to the processor 103.

Referring to FIG. 2, a graph 200 and associated spectral plots 205-n illustrate speech data that are received and processed, for example at the radio telephone 100, according to an embodiment of the present invention. The graph 200 plots the speech data as sound amplitude versus time. Those skilled in the art will recognize three primary peak waveform envelopes 210-n as voiced speech, and the relatively low amplitude intervals between the peak waveform envelopes 210-n as unvoiced speech.

Conventional speech recognition processes address the complex technical problem of identifying phonemes, which are the smallest vocal sound units that are used to create words. Speech recognition is generally a statistical process that requires computationally intensive analysis of speech data. Such analysis includes recognition of acoustic variabilities like background noise and transducer-induced noise, and recognition of phonetic variabilities like the acoustic differences in individual phonemes. According to one embodiment, the present invention is a method, which is significantly less computationally intensive than conventional speech recognition methods, for classifying speech data to enable believable and authentic-looking animation of the mouth features of an avatar. For example, an avatar can be displayed on the screen 105 of the phone 100, and appear to be speaking in real-time the words of a caller that are received by the transceiver 108 and amplified over the communications speaker 140. Such a method is described in detail below.

First, speech data such as that illustrated in the graph 200 are filtered by identifying voiced speech segments of the speech data. Identifying voiced speech segments can be performed using various techniques known in the art such as energy analyses and zero crossing rate analyses. High energy components of speech data are generally associated with voiced sounds, and low to medium energy speech data are generally associated with unvoiced sounds. Very low energy components of speech data are generally associated with silence or background noise.

Zero crossing rates are a simple measure of the frequency content of speech data. Low frequency components of speech data are generally associated with voiced speech, and high frequency components of speech data are generally associated with unvoiced speech.

After voiced speech segments are identified, a high-amplitude spectrum is determined for each segment. Thus, for each segment, normalized Fast Fourier Transform (FFT) data are determined by normalizing according to amplitude an FFT of a high -amplitude component of each voiced speech segment. For example, in the graph 200, the high amplitude component for each segment is identified as a "key frame" that includes a peak amplitude within each segment. The key frames typically have a fixed time window (about 30ms) and the number of samples may vary depending on the sample rate of the speech data. For example, a typical key frame can include a length L = 256 samples at a sample rate of 8 kHz, or L = 512 samples at a sample rate of 16 kHz.

The normalized FFT data are then filtered so as to accentuate peaks in the data. For example, a high-pass filter having a threshold setting of 0.1 can be applied, which sets all values in the FFT data that are below the threshold setting to zero.

The normalized and filtered FFT data are then processed by one or more peak detectors. The peak detectors detect various attributes of peaks such as a number of peaks, a peak distribution and a peak energy. Using data from the peak detectors, the normalized and filtered FFT data, which likely represent a high-amplitude spectrum of a main vowel sound, are then divided into sub-bands. For example, according to one embodiment of the present invention four sub-bands are used, which are indexed from 0 to 3. If the energy of a high-amplitude spectrum is concentrated in sub-band 1 or 2, the spectrum is classified as most likely corresponding to a main vowel phoneme /a/. If the energy of the high-amplitude spectrum is concentrated in sub-band 0 and 2, the spectrum is classified as most likely corresponding to a main vowel phoneme /i/. Finally, if the energy of the high-amplitude spectrum is concentrated in sub-band 0, the spectrum is classified as most likely corresponding to a main vowel phoneme /u/. FIG. 2 illustrates a normalized spectrum 205-1 corresponding to the peak waveform envelope 210-1 and to an /a/ main vowel phoneme, a normalized spectrum 205-2 corresponding to the peak waveform envelope 210-2 and an IM main vowel phoneme, and a normalized spectrum 205-3 corresponding to the peak waveform envelope 210-3 and a IvJ main vowel phoneme. According to one embodiment of the present invention, the classified spectra are used to animate features of an avatar so as to create the impression that the avatar is actually "speaking" the speech data. Such animation is performed by mapping the classified spectra to discrete mouth movements. As is well known in the art, discrete mouth movements can be replicated by an avatar using a series of visemes, which essentially are basic speech units mapped into the visual domain. Each viseme represents a static, visually contrastive mouth shape, which generally corresponds to a mouth shape that is used when a person pronounces a particular phoneme.

The present invention can efficiently perform such phoneme-to-viseme mapping by exploiting the fact that the number of phonemes in a language is much greater than the number of corresponding visemes. Further, the main vowel phonemes IsJ, Ii/, and IvJ each can be mapped to one of three very distinct visemes. By using only these three distinct visemes—coupled with image frames of a mouth moving from a closed to an open and then again to a closed position—cartoon-like, believable mouth movements can be created. Because only three main vowel phonemes are recognized in the speech data, the speech recognition of embodiments of the present invention is significantly less processor intensive than prior art speech recognition. For example, various vowel phonemes in the English language are all grouped, according to an embodiment of the present invention, into reduced vowel sets using the three main vowel phonemes of /a/, Ii/, and /u/, as shown in Table 1 below.

Table 1 - Reduced Vowel Sets in English

Speech data that are classified according to the teachings of the present invention can be used to control the motion of mouth and lip graphics on an avatar using techniques such as mouth width mapping according to speech energy, or mouth shape mapping according to a spectrum structure of the speech data. For example, mouth width mapping concerns the opening and closing of a mouth during a peak waveform envelope 210-n. Consider where i images, numbered from 0 to i — 1, are used to describe a peak waveform envelope 210-n. Mouth width mapping first sets a beginning unvoiced segment of the peak waveform envelope 210-n to zero, thus representing a closed mouth. Remaining data frames in the peak waveform envelope 210-n are then mapped to the images 1 to z - 1 according to the speech energy in each respective frame. Finally, to make the perceived motion of a mouth and lips on an avatar appear more natural, post processing of mouth and lip graphics is performed to provide a smooth transition between images.

Referring to FIG. 3, a schematic block diagram 300 illustrates functional components of a speech classification and mouth movement mapping process, according to an embodiment of the present invention. The process is divided into three primary functional blocks: a key frame identification block 305, a reduced vowel set classification block 310, and an animation synthesis block 315.

In the key frame identification block 305, input speech data are received and processed in parallel in an energy analysis block 320 and a rate of cross-zero block 325. Data from the energy analysis block 320 and rate of cross-zero block 325 are then supplied to a voice/unvoice detection block 330, which separates voiced and unvoiced speech data. Data from the voice/unvoice detection block 330 is then processed in a voiced envelope generator block 335, which identifies voiced speech segments of the speech data. As shown in the key frame identification block 305, raw data from the energy analysis block 320 is also used in the voiced envelope generator block 335.

In the reduced vowel sets classification block 310, the voiced speech segments are provided to a key frame spectrum analysis block 340, which performs a spectral analysis on high-amplitude components of the voiced speech segments and determines high-amplitude spectra. Next, in a classification block 345, the high-amplitude spectra are classified as main vowel phonemes /a/, /i/ or IvJ.

Finally, in the animation synthesis block 315, the main vowel phonemes output from the classification block 350 are mapped to visemes that are used to animate an avatar. The animation synthesis block 315 retrieves information from an animation material database 355, including for example viseme definitions and information concerning conventional mouth opening and closing positions used when transitioning between phonemes. A final output of the animation synthesis block 315 is thus speech- synchronized animation.

Referring to FIG. 4, a general flow diagram illustrates a method 400 for classifying speech data according to an embodiment of the present invention. First, at step 405 speech data are received at a mobile wireless communication device such as the radio phone 100. At step 410 a voiced speech segment of the speech data is identified. Next, at step 415 a high-amplitude spectrum is determined by performing a spectral analysis on a high-amplitude component of the voiced speech segment. Step 415 can comprise the following sub-steps: At step 420 normalized FFT data are determined by normalizing according to amplitude an FFT of the high-amplitude component of the voiced speech segment. At step 425 the normalized FFT data are filtered to create normalized and filtered FFT data. Then at step 430 peaks are detected in the normalized and filtered FFT data. At step 435 the normalized and filtered FFT data are classified into sub-bands based on attributes of detected peaks such as a total peak number, peak distribution, and peak energy. At step 440 the high-amplitude spectrum is then classified as a vowel phoneme, such as one of the main vowel phonemes /a/, IiI, or IvJ. Finally, at step 445, the main vowel phoneme is mapped to a viseme associated, for example, with a mouth position of an avatar.

Advantages of the present invention therefore include improved animations of avatars using real-time speech data. Further, the methods of the present invention are less computationally intensive than most conventional speech recognition methods, which enables the methods of the present invention to be executed faster while using fewer processor resources. Embodiments of the present invention are thus particularly suited to mobile communication devices that have limited processor and memory resources.

The above detailed description provides an exemplary embodiment only, and is not intended to limit the scope, applicability, or configuration of the present invention. Rather, the detailed description of the exemplary embodiment provides those skilled in the art with an enabling description for implementing the exemplary embodiment of the invention. It should be understood that various changes can be made in the function and arrangement of elements and steps without departing from the spirit and scope of the invention as set forth in the appended claims. Tt will be appreciated that embodiments of the invention described herein may be comprised of one or more conventional processors and unique stored program instructions that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of classifying speech data as described herein. The non- processor circuits may include, but are not limited to, a radio receiver, a radio transmitter, signal drivers, clock circuits, power source circuits, and user input devices. As such, these functions may be interpreted as steps of a method for classifying speech data. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used. Thus, methods and means for these functions have been described herein. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.

In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present invention. The benefits, advantages, solutions to problems, and any elements that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as critical, required, or essential features or elements of any or all of the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims.

Claims

ClaimsWe Claim:

1. A method for classifying speech data, comprising: identifying a voiced speech, segment of the speech data; determining a high-amplitude spectrum, by performing a spectral analysis on a high-amplitude component of the voiced speech segment; and classifying the high-amplitude spectrum as a vowel phoneme, wherein the vowel phoneme is selected from a reduced vowel set.

2. The method of claim 1, wherein the reduced vowel set consists of only the main vowel phonemes IaJ, IiI and IuI.

3. The method of claim 2, wherein the main vowel phoneme /a/ comprises English phonemes selected from the group /ax/, /aa/, /ae/, /ao/, /aw/, /er/, /ay/, /eh/, and /ey/; the main vowel phoneme /i/ comprises English phonemes selected from the group /ih/ and /iy/; and the main vowel phoneme IuI comprises Engish phonemes selected from the group /awl, /oy/, /uh/, and /uw/.

4. The method of claim 1, further comprising mapping the vowel phoneme to a viseme for animating an avatar.

5. The method of claim 1, further comprising receiving the speech data at a mobile wireless communication device.

6. The method of claim 5, wherein classifying the high-amplitude component of the voiced speech segment is performed in real time as the speech data are received.

7. The method of claim 1 , wherein determining the high-amplitude spectrum comprises: determining normalized Fast Fourier Transform (FFT) data, by normalizing according to amplitude a Fast Fourier Transform (FFT) of the high-amplitude component of the voiced speech segment; filtering the normalized FFT data to create normalized and filtered FFT data; detecting peaks in the normalized and filtered FFT data; and classifying the normalized and filtered FFT data into sub-bands based on detected peaks.

8. The method of claim 7, wherein detecting peaks in the normalized and filtered FFT data comprises counting a number of peaks, measuring peak distributions, and measuring peak energies.

9. The method of claim 7, wherein the sub-bands are indexed from 0 to 3 and the reduced vowel set comprises the following main vowel phonemes:

/a/, where an energy of the high-amplitude spectrum is concentrated in sub- bands 1 or 2;

/i/, where an energy of the high-amplitude spectrum is concentrated in sub- bands 0 and 2; and IvJ, where an energy of the high-amplitude spectrum is concentrated in sub-band 0.

10. The method of claim 1, wherein the high amplitude component of the voiced speech segment comprises a key frame of the speech data.