US3511932A - Self-oscillating vocal tract excitation source - Google Patents

Self-oscillating vocal tract excitation source Download PDF

Info

Publication number
US3511932A
US3511932A US664130A US3511932DA US3511932A US 3511932 A US3511932 A US 3511932A US 664130 A US664130 A US 664130A US 3511932D A US3511932D A US 3511932DA US 3511932 A US3511932 A US 3511932A
Authority
US
United States
Prior art keywords
signal
vocal
glottal
area
vocal tract
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US664130A
Inventor
James L Flanagan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
Bell Telephone Laboratories Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bell Telephone Laboratories Inc filed Critical Bell Telephone Laboratories Inc
Application granted granted Critical
Publication of US3511932A publication Critical patent/US3511932A/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • FIG. 3A FIG. 38 FIG. 3C
  • This invention relates to apparatus for synthesizing speech and, more specifically, to apparatus for generating a vocal tract excitation signal.
  • Another object of the invention is to generate a natural vocal tract excitation signal in response to coded input signals.
  • Still another object of this invention is to generate, automatically, a vocal tract excitation signal in accordance with physiological parameters of the human glottal system.
  • these and other objects are accomplished by turning to account certain properties of the human articulatory system.
  • phonation takes place in the human articulatory system through the vibratory action of the vocal cords.
  • Several forces, acting upon the vocal cords cooperate to vary the vocal cord orifice, or glottal area, thereby controlling air flow into the vocal tract.
  • the air flow through a properly configured tract gives rise to audible speech.
  • control signals developed to represent the momentary glottal area and the driving air force for a particular sound may be used to control the synthesis of artificial speech.
  • a function proportional to the glottal area may be generated automatically by representing the vocal cords as a second order oscillatory system that is driven by a nonlinear forcing function.
  • symmetrical vocal cords may be represented as a mechanical oscillator of mass M, spring constant K, and viscous damping B, which is driven by a forcing function F.
  • Mass M is fixed at a value representative of human cords
  • spring constant K is selected to correspond to vocal cord tension
  • viscous damping B is valued to represent rubbing and collision of the cords.
  • Forcing function F is chosen to represent orifice inlet and outlet pressures acting upon the vocal cord surface area. These pressures are related to the subglottal pressure and the resultant Bernoulli pressure with the orifice.
  • a glottal area function is developed which may be used, in turn, to generate a vocal tract excitation signal capable of controlling the synthesis of natural-sounding speech.
  • a vocal tract excitation signal is generated by representing the human vocal cord orifice by a series circuit connection of two variable resistors and a variable indutor, and by controlling the values of the circuit components in accordance with variations in glottal area.
  • a feature of the present invention is the determination of pitch inflection, glottal waveform, sourcesystem interaction, stress, and irregular temporal changes in signal detail according to the physiological parameters of subglottal pressure, vocal cord tension and vocal tract shape.
  • FIG 1 illustrates a speech synthesizer that employs the present invention
  • FIG. 2 shows in block schematic form the details of the excitation signal generator of FIG. 1;
  • FIG. 3A shows diagrammatically a mechanical model of the vocal cords for elastic collision
  • FIG. 3B shows a mechanical model of the vocal cords for a viscous collision
  • FIG. 3C shows diagrammatically another view of the model of FIG. 3B
  • FIG. 4 shows graphically the distribution of pressure within the vocal cord orifice
  • FIG. 5A shows graphically a typical subglottal pressure characteristic
  • FIG. 5B shows graphically a typical fundamental frequency characteristic
  • FIG. 6A shows graphically variations in glottal area and an excitation signal during an elastic collision of the vocal cords
  • FIG. 6B shows graphically variations in glottal area and an excitation signal during a viscous collision of the vocal cords
  • FIG. 7 shows in block schematic form the details of the glottal area function generator of FIG. 2;
  • FIG. 8 shows in block schematic form the details of a variable resistor 21 suitable for use in the apparatus of FIG. 2.
  • a spoken word is composed of a sequence of linguistic elements called phonemes.
  • Certain words are thus usually stressed when spoken, in the fashion indicated by punctuation in writing.
  • phonemes, or sequence of phonemes, together with suitable stressing must be specified, preferably in a form directly acceptable by a synthesizer.
  • FIG. 1 shows in block schematic form apparatus which responds to a stressed phonemic input to develop signals useful for producing synthetic speech.
  • Phoneme symbol sequence generator 10 is utilized to produce appropriate signals for initiating synthesis of a spoken word.
  • symbols corresponding to the phonemes of word groups and their associated stress (punctuation) are stored, for example, on a tape in coded form.
  • this information is supplied, preferably, although not necessarily, in digital form, to vocal tract area function generator 11, subglottal pressure function generator 12, and vocal cord tension function generator 13.
  • phoneme symbol sequence generator 10 Operation of phoneme symbol sequence generator 10 is conventional; the generator may be of any desired form. For example, it may be of the type described in detail in Dudley-Harris Pat. 2,771,509, or in Gerstman- Kelly Pat. 3,158,685. If so desired, either of the codes described in the respective patents may be utilized in the practice of the present invention.
  • each phoneme of a particular sequence is employed to control the generation of a plurality of vocal tract area control signals, Av through Av which represent the changes in area of respective segments of the vocal tract, and a vocal tract excitation signal U
  • Av through Av represent the changes in area of respective segments of the vocal tract
  • U vocal tract excitation signal
  • Area signals Av through Av for controlling the synthesis of voiced sounds by vocal tract synthesizer 30 are generated by vocal tract area function generator 11.
  • apparatus 11 In order to obtain natural sounding speech, it is preferable that apparatus 11 generate volcal tract area control signals in accordance with certain physiological characteristics of the human vocal tract. Physiological information is therefore used to obtain the proper interpolation between the area control signals generated for one phoneme in a particular sequence and the area control signals generated for the next phoneme in the sequence. This in- 4 formation is included in the area control signals Av delivered by generator 11 to synthesizer 30.
  • Function generator 11 also develops signals which represent the source (lv and terminating (1V lengths of the vocal tract. These data are used to control corresponding adjustments in synthesizer 30. Finally, a signal which denotes the duration 1 of the entire sequence of area control signals for a particular word group is also generated by apparatus 11. Use of the time duration signal will be described later in connection with descriptions of sub-glottal pressure generator 12 and vocal cord tension generator 13.
  • Vocal tract area function generating apparatus which provides all of the necessary synthesizer control signals, and which is satisfactory for use in the practice of the invention, is described in detail in a copending application of C. H. Coker, filed Aug. 29, 1967, Ser. No. 664,129.
  • an excitation signal is generated automatically that is related to a corresponding word group.
  • Subglottal pressure function generator 12 and vocal cord tension function generator 13 are used for this purpose.
  • subglottal pressure function generator 12 and vocal cord tension function generator 13 are utilized to supply information which is later used in the generation of the vocal tract excitation signal in excitation generator 20, to be explained later in connection with the details as shown in FIG. 2.
  • Typical examples of this supersegmental information are shown in FIGS. 5A and 5B.
  • Subglottal pressure and fundamental frequency (vocal cord tension is proportional to the square of fundamental frequency), corresponding to a particular word group or phrase, are shown.
  • This supersegmental information may be stored in any of a number of ways for future use, as required.
  • the magnitude of the respective information relating to each of 11 segments of time At may be established, for example, by the adjustment of a potentiometer.
  • the subglottal pressure signal P and vocal cord tension signal T(t) can be generated by sweeping through It potentiometers during time duration t of the associated phonemic sequence making up the desired word group.
  • a signal identifying time duration t is supplied from the vo cal tract area function generator 11, to both generators 12 and 13, and the associated phonemic sequence, in coded form, is supplied to both generators from phoneme symbol sequence generator 10.
  • This information i.e., subglottal pressure and vocal cord tension
  • the digital output from such a system may readily be converted to analog form in any of a number of ways well known in the digital-to-analog conversion art.
  • Apparatus for accumulating the necessary physiological data such as subglottal pressure and fundamental frequency relating to any desired word group or phrase, is disclosed in chapter 4 of Intonation, Perception and Len guage, by Philip Lieberman, MIT Press 1967. Diagrams representative of subglottal pressure and fundamental frequency related to certain Word groups are also to be found in the Lieberman reference. These data are stored for future use by function generators 12 and 13.
  • generators 12 and 13 i.e., P and T(t), respectively, are supplied to excitation signal generator 20 where, in accordance with this invention, a vocal tract excitation signal, U is generated. Details of generator and its operation are discussed below with reference to FIG. 2.
  • Excitation signal U vocal tract area signals Av through Av and vocal tract length control signals [v and lv are applied to vocal tract synthesizer 30. Each of the applied vocal tract signals controls the value of a circuit element in the synthesizer which represents an element of the human vocal tract. Excitation signal U represents the air flow into the human vocal tract. Synthesizer 30, of known construction, responds to the applied control signals and develops artificial speech sounds which may be used to energize a transducer such as loudspeaker 32.
  • an unvoiced signal output must be generated. This is accomplished by inserting an appropriate noise source within vocal tract synthesizer 30 in a fashion also well known in the art.
  • a suitable transmission line type of synthesizer which may be used in the practice of the present invention is discussed in detail in Dynamic Analog Speech Synthesizer, by George Rosen in The Journal of the Acoustical Society of America, March 1958.
  • FIG. 2 shows in block schematic form the detail of excitation generator 20. Apparatus is utilized including: an orifice circuit which comprises variable resistor 21, variable resistor 22, and variable inductor 23; delay circuit and glottal area function generator 24.
  • the vocal cord orifice, or glottal impedance may be approximated by a series combination of resistors and an inductor that vary in value in proportion to the glottal area and excitation signal. Accordingly:
  • variable resistor can be either a rheostat driven by a servomechanism or an appropriately biased field effect transistor.
  • Variable inductor 23 again, may be one which is controlled by a servomechanism and incorporates the proper amplifying means for scaling the glottal area signal to meet the conditions of Equation 3.
  • the vocal cords may be represented as a second order mechanical system that is driven by a nonlinear forcing function.
  • a typical example of such a mechanical system is shown in FIG. 3A.
  • Mass M is set at a value for a typical set of human vocal cords
  • B is a viscous damping factor selected to simulate cord rubbing
  • K is the equivalent of vocal cord tension.
  • An oscillating system of this type may be expressed mathematically as follows:
  • F (t) represents a forcing function
  • (w-d) is the vocal cord surface area
  • P and P are the inlet and outlet pressures, respectively, of the vocal cord orifice.
  • Pressures P and P are related to the subglottal pressure P, and Bernoulli pressure P in the vocal cord orifice according to the graph, as shown in FIG. 4, which is a diagram of pressure versus distance through the vocal cord orifice.
  • FIG. 4 is a diagram of pressure versus distance through the vocal cord orifice.
  • the glottal excitation signal is generated as follows: an initial value neutral glottal area A is supplied to control variable resistors 21 and 22 and variable inductor 23 in order to produce an initial current value for excitation signal U Area signal A is delayed by time delay circuit 25, which may be simply a conventional delay line, and is fed back as delayed signal A to glottal area function generator 24. The delay is necessary so that the proper value of F (t) is generated. Thus, feeding back A and the corresponding value of U F (t) is generated, from which displacement x is obtained, and subsequently, a new value of A This process is iterated, thereby obtaining a continuous vocal tract excitation signal.
  • Generator 20 is self-oscillating, requiring only the input of signals representative of subglottal pressure and vocal cord tension plus the loading effect of vocal tract synthesizer 30 to produce automatically an excitation signal U Such a signal is shown in FIG. 6A where glottal area A and excitation signal U are shown with respect to time.
  • FIG. 3B and FIG. 3C A more accurate mechanical representation of the human vocal cords is shown in FIG. 3B and FIG. 3C.
  • This approximation again assumes symmetrical vocal cords but in this instance the collision of the cords is assumed to be viscous, i.e., inelastic, or semielastic.
  • the collision of mass M with dashpot 71 creates an additional damping B which causes the glottal area A and excitation signal U to remain at a zero value for a portion of each cycle.
  • the resultant variations in the glottal area and excitation signals, as shown in FIG. 6B, are believed to be a more accurate representation of the corresponding human functions.
  • the mathematical representation of such a system for a viscous collision is as follows:
  • FIG. 7 shows in block schematic form the details of glottal area function generator 24.
  • the delayed area function signal A is squared in squaring circuit 40, scaled in amplifier 41, and applied to one input of divider 42.
  • Excitation signal U corresponding to A is squared in squarer 43, scaled in amplifier 44, and applied to the other input of divider 42.
  • the resultant signal k P is applied to one input of subtracter 45.
  • Subglottal pressure signal P is applied to the other input of subtracter Where the applied signals are accordingly subtracted and scaled to generate a signal which represents forcing function EU).
  • the forcing function signal and a signal representing vocal cord tension T(t) are fed into computing network 50, preferably an analog computer, where they are utilized to generate a displacement signal 1:.
  • the voiced fundamental frequency f (voiced) is The coefiicient relating to vocal cord tension, that is, K can be varied in accordance with principles well known in the analog computation art.
  • the resultant output from network is displacement signal x.
  • Displacement signal at is scaled by the factor w in amplifier 51 and added to area function signal A, in summing amplifier 52.
  • the resultant sum represents the desired glottal area A
  • Level detector 53 and switching transistors 54 and 55 are utilized to ensure that the glottal area signal and the excitation signal are equal to 0 for values of x x This feature may be employed, if desired, in addition to the limiters usually incorporated in computing network 50.
  • FIG. 8 shows in block schematic form the details of variable resistor 21 (of FIG. 2).
  • glottal area signal A is squared in squaring network and the resultant is applied to one input of divider 61.
  • Excitation signal U is applied to circuit 62, such as a full wave rectifier, where its absolute magnitude lU l is established. This value is applied to the second input of divider 61.
  • the output of the divider 61
  • variable resistor 63 controls variable resistor 63.
  • Variable resistor 63 may be any of a number of types, for example, it can be a rheostat that is driven by a servomecham'sm, or a field effect transistor that is appropriately biased to be varied by an input signal corresponding to Equation 1.
  • the vocal cords may be represented as a dual mechanical oscillatory system in which each cord is simulated by an individual mechanical oscillator or further, each cord may be simulated by a matrix of interconnected masses and springs which are driven by an appropriate forcing function.
  • Apparatus for generating a vocal tract excitation signal which comprises,
  • Apparatus for generating a vocal tract excitation signal which comprises, in combination,
  • function generator means responsive to said pressure and tension signals for generating a signal representative of glottal area
  • controllable impedance means responsive to said area signal for varying a signal applied to a vocal tract synthesizer.
  • Apparatus as defined in claim 2 further including means for delaying said glottal area signal a predetermined interval and wherein said function generator means comprises,
  • computing means supplied with said signal representation of forcing function F(t) and with a signal representative of said vocal cord tension for generating a glottal displacement signal x, said computing means being preset in accordance with preestablished vocal cord mass and viscous damping parameters, and
  • Apparatus for generating an excitation signal for energizing a vocal tract synthesizer which comprises, in combination:
  • first means responsive to said phonemic sequence signal for generating a signal representative of subglottal pressure
  • second means responsive to said phonemic sequence signal for generating a signal representative of vocal cord tension
  • means responsive to said pressure and tension signals for generating a controlling signal
  • means responsive to said controlling signal for varying signals supplied as excitation to a vocal tract synthesizer.
  • control signal means comprises,
  • varying means comprises a series of circuit including: first controllable resistive means, second controllable resistive means, and controllable inductor means.
  • controllable impedance means supplied with said subglottal pressure signal and responsive to said glottal area signal for controlling the excitation potential applied to said vocal tract synthesizer, said controllable impedance means including a first variable resistor, a second variable resistor and a variable inductor,
  • said subglottal area signal generating means including means for delaying said glottal area signal a predetermined interval
  • computational means responsive to said forcing function signal F(t) and said vocal cord tension signal T(t) for generating a glottal displacement signal x, said computational means being preset in accordance 'with preestablished vocal cord mass and viscous damping parameters,

Description

May 12, 1970 SELF-OSCILLATING VOCAL TRACT EXCITATION SOURCE Filed Aug. 29, 1967 FIG.
H I Av l VOCIAL TAQACT Avg Av t FUNCTION L GEN.
PHONEME l2 SYMBOL P 1 SEQUENCE SUBGLOTTAL 5 2 GEN. PRESSURE l 32 FUNCTION EXCITATION vocALJRAcT' S'GNAL SYNTHESIZER VOCAL CORD m) E u N c w N GEN. 30
F IG 2 2 VARIABLE E VARIABLE VARIABLE RESISTOR RESISTOR INDUCTOR DELAY GLOTTAL AREA A m) FUNCTION GEN. 25
FIG. 3A FIG. 38 FIG. 3C
K| K| l M| o M| O M| X 0 X X PS P' P2 X=X P2 x-x =X A C JIA C filf B t 77777 2 L d 1* 4 lw 4 /NVEN7'0R J. L. FLANAGAN ATTORNEY J. LEFLANAGAN 3, 511,9 .?A2
3 Sheets-Sheet 1 May 12, 1970 ,.J. 1.. am 3,511,932
SELF-OSCILLATING VOCAL TRACT EXCITATIONI SOURCE Find Aug. 29, 1967 3 Sheets-Sheet 2 E a o DISTANCE 0 TIME United States Patent 3,511,932 SELF-OSCILLATING VOCAL TRACT EXCITATION SOURCE James L. Flanagan, Warren Township, Somerset County,
N.J., assignor to Bell Telephone Laboratories, Incorporated, Murray Hill and Berkeley Heights, N.J., a
corporation of New York Filed Aug. 29, 1967, Ser. No. 664,130 Int. Cl. Gl 1/00 U.S. Cl. 1791 10 Claims ABSTRACT OF THE DISCLOSURE Speech synthesizers often produce unnatural sounding speech. This is due in large measure to the problem of developing a suitable vocal tract excitation signal. This problem is overcome by representing the human vocal cords as a second order oscillatory system that is driven by a nonlinear forcing function of subglottal pressure. Accordingly, apparatus is utilized that generates an excitation signal in response to the physiological parameters of subglottal pressure, vocal cord tension, and vocal tract configuration.
BACKGROUND OF THE INVENTION Field of the invention This invention relates to apparatus for synthesizing speech and, more specifically, to apparatus for generating a vocal tract excitation signal.
Description of the prior art Problems arise in producing synthetic speech of acceptable quality due to difficulties in duplicating human vocal tract excitation information. Several systems have been proposed which attempt to produce natural-sounding speech through synthesis. For example, a system which reproduces prerecorded phonetic sequences to form word groups in response to a coded input signal is disclosed in US. Pat. 2,771,509, issued to H. W. Dudley and C. M. Harris on Nov. 20, 1956. An improvement of the Dudley-Harris apparatus is disclosed in U.S. Pat. 3,158,685, issued to L. J. Gerstman and J. L. Kelly, Jr. on Nov. 24, 1964. In the Gerstman-Kelly apparatus, phonetic symbols are converted to electrical signals which control a formant synthesizer to reproduce speech. Synthesized speech produced by these systems, however, does not attain a quality comparable to human speech; it is generally characterized as computer talk or machine talk. Thus, although speech so generated may be acceptable functionally, it is unsatisfactory for general application due to its unnatural sound.
Still another suggested approach to improving the quality of synthesized speech is disclosed in chapter 3 of Speech Analysis, Synthesis and Perception, by J. L. Flanagan, Academic Press, 1965. There, in an attempt further to simulate the human articulatory system, the vocal cord orifice is represented by a series circuit connection of a variable resistor and a variable inductor. A vocal tract excitation signal is produced by supplying a signal representative of subglottal pressure to the circuit and varying the values of the circuit elements in dependence upon glottal area. The glottal area information is obtained for this system from photographs of the vocal cords of a human subject as he utters a sound. Obviously, such a system is limited in application. Further, the system still does not yield a versatile excitation signal and, moreover, requires vocal cord area data as an additional input.
Therefore, although such a system may be advantageice ously employed to analyze certain sounds, it is unsatisfactory for practical use in synthesizing total human speech. Furthermore, difficulty is experienced in developing a vocal tract excitation signal in accordance with a given phonemic sequence.
SUMMARY OF THE INVENTION Therefore, it is an object of this invention to generate natural-sounding speech sounds.
Another object of the invention is to generate a natural vocal tract excitation signal in response to coded input signals.
Still another object of this invention is to generate, automatically, a vocal tract excitation signal in accordance with physiological parameters of the human glottal system.
In accordance with this invention, these and other objects are accomplished by turning to account certain properties of the human articulatory system. Important among these, is the realization that phonation takes place in the human articulatory system through the vibratory action of the vocal cords. Several forces, acting upon the vocal cords, cooperate to vary the vocal cord orifice, or glottal area, thereby controlling air flow into the vocal tract. The air flow through a properly configured tract gives rise to audible speech. In like manner, control signals developed to represent the momentary glottal area and the driving air force for a particular sound, may be used to control the synthesis of artificial speech.
I have discovered that a function proportional to the glottal area may be generated automatically by representing the vocal cords as a second order oscillatory system that is driven by a nonlinear forcing function. For example, symmetrical vocal cords may be represented as a mechanical oscillator of mass M, spring constant K, and viscous damping B, which is driven by a forcing function F. Mass M is fixed at a value representative of human cords, spring constant K is selected to correspond to vocal cord tension, and viscous damping B is valued to represent rubbing and collision of the cords. Forcing function F is chosen to represent orifice inlet and outlet pressures acting upon the vocal cord surface area. These pressures are related to the subglottal pressure and the resultant Bernoulli pressure with the orifice.
By supplying signals representative of subglottal pressure to an oscillatory system of this sort in accordance with parameter values dictated by sequences of phonemes, word groups, or the like, a glottal area function is developed which may be used, in turn, to generate a vocal tract excitation signal capable of controlling the synthesis of natural-sounding speech.
Accordingly, in the practice of this invention, a vocal tract excitation signal is generated by representing the human vocal cord orifice by a series circuit connection of two variable resistors and a variable indutor, and by controlling the values of the circuit components in accordance with variations in glottal area.
Therefore, a feature of the present invention is the determination of pitch inflection, glottal waveform, sourcesystem interaction, stress, and irregular temporal changes in signal detail according to the physiological parameters of subglottal pressure, vocal cord tension and vocal tract shape.
These and other objects and advantages of the invention will be more fully understood from the following detailed description of an illustrative embodiment thereof taken in connection with the appended drawings.
BRIEF DESCRIPTION OF THE DRAWINGS FIG 1 illustrates a speech synthesizer that employs the present invention;
FIG. 2 shows in block schematic form the details of the excitation signal generator of FIG. 1;
FIG. 3A shows diagrammatically a mechanical model of the vocal cords for elastic collision;
- FIG. 3B shows a mechanical model of the vocal cords for a viscous collision;
FIG. 3C shows diagrammatically another view of the model of FIG. 3B;
FIG. 4 shows graphically the distribution of pressure within the vocal cord orifice;
FIG. 5A shows graphically a typical subglottal pressure characteristic;
FIG. 5B shows graphically a typical fundamental frequency characteristic;
FIG. 6A shows graphically variations in glottal area and an excitation signal during an elastic collision of the vocal cords;
FIG. 6B shows graphically variations in glottal area and an excitation signal during a viscous collision of the vocal cords;
FIG. 7 shows in block schematic form the details of the glottal area function generator of FIG. 2; and
FIG. 8 shows in block schematic form the details of a variable resistor 21 suitable for use in the apparatus of FIG. 2.
DETAILED DESCRIPTION OF THE INVENTION A spoken word is composed of a sequence of linguistic elements called phonemes. The inflection given to a word, or a group, often determines meaning. Certain words are thus usually stressed when spoken, in the fashion indicated by punctuation in writing. For the production of synthetic speech, phonemes, or sequence of phonemes, together with suitable stressing must be specified, preferably in a form directly acceptable by a synthesizer.
FIG. 1 shows in block schematic form apparatus which responds to a stressed phonemic input to develop signals useful for producing synthetic speech. Phoneme symbol sequence generator 10 is utilized to produce appropriate signals for initiating synthesis of a spoken word. In generator 10, symbols corresponding to the phonemes of word groups and their associated stress (punctuation) are stored, for example, on a tape in coded form. Upon request, this information is supplied, preferably, although not necessarily, in digital form, to vocal tract area function generator 11, subglottal pressure function generator 12, and vocal cord tension function generator 13.
Operation of phoneme symbol sequence generator 10 is conventional; the generator may be of any desired form. For example, it may be of the type described in detail in Dudley-Harris Pat. 2,771,509, or in Gerstman- Kelly Pat. 3,158,685. If so desired, either of the codes described in the respective patents may be utilized in the practice of the present invention.
Once a stored word group, that is, a phonemic se quence, has been read out in coded form, it is necessary to generate additional signals that are utilized to produce synthetic speech. In accordance with this invention, each phoneme of a particular sequence is employed to control the generation of a plurality of vocal tract area control signals, Av through Av which represent the changes in area of respective segments of the vocal tract, and a vocal tract excitation signal U These signals are sufiicient to energize and control vocal tract synthesis apparatus 30.
Area signals Av through Av for controlling the synthesis of voiced sounds by vocal tract synthesizer 30 are generated by vocal tract area function generator 11. In order to obtain natural sounding speech, it is preferable that apparatus 11 generate volcal tract area control signals in accordance with certain physiological characteristics of the human vocal tract. Physiological information is therefore used to obtain the proper interpolation between the area control signals generated for one phoneme in a particular sequence and the area control signals generated for the next phoneme in the sequence. This in- 4 formation is included in the area control signals Av delivered by generator 11 to synthesizer 30.
Function generator 11 also develops signals which represent the source (lv and terminating (1V lengths of the vocal tract. These data are used to control corresponding adjustments in synthesizer 30. Finally, a signal which denotes the duration 1 of the entire sequence of area control signals for a particular word group is also generated by apparatus 11. Use of the time duration signal will be described later in connection with descriptions of sub-glottal pressure generator 12 and vocal cord tension generator 13.
Vocal tract area function generating apparatus which provides all of the necessary synthesizer control signals, and which is satisfactory for use in the practice of the invention, is described in detail in a copending application of C. H. Coker, filed Aug. 29, 1967, Ser. No. 664,129.
In developing natural sounding synthetic speech, it is additionally necessary to generate a suitable vocal tract excitation signal. Thus, in the practice of this invention, an excitation signal is generated automatically that is related to a corresponding word group. Subglottal pressure function generator 12 and vocal cord tension function generator 13 are used for this purpose.
In speaking a particular word group, or phrase, one gives it a certain inflection. For example, it may be either a declarative, a question of an exclamation, which one intends another to perceive. Therefore, in order to produce natural sounding speech through synthesis, it is necessary to generate signals which exhibit these characteristics. Thus, information is required in addition to the segmental signals of the phoneme sequence. Consequently, subglottal pressure function generator 12 and vocal cord tension function generator 13 are utilized to supply information which is later used in the generation of the vocal tract excitation signal in excitation generator 20, to be explained later in connection with the details as shown in FIG. 2. Typical examples of this supersegmental information are shown in FIGS. 5A and 5B. Subglottal pressure and fundamental frequency (vocal cord tension is proportional to the square of fundamental frequency), corresponding to a particular word group or phrase, are shown.
This supersegmental information may be stored in any of a number of ways for future use, as required. For example, the magnitude of the respective information relating to each of 11 segments of time At, may be established, for example, by the adjustment of a potentiometer. Thus, the subglottal pressure signal P and vocal cord tension signal T(t) can be generated by sweeping through It potentiometers during time duration t of the associated phonemic sequence making up the desired word group. A signal identifying time duration t is supplied from the vo cal tract area function generator 11, to both generators 12 and 13, and the associated phonemic sequence, in coded form, is supplied to both generators from phoneme symbol sequence generator 10. This information, i.e., subglottal pressure and vocal cord tension, may also be stored in digital form, for example, in an associative storage system which permits certain information, corresponding to the particular input phonemic sequence plus punctuation, to be read out. The digital output from such a system may readily be converted to analog form in any of a number of ways well known in the digital-to-analog conversion art.
Apparatus for accumulating the necessary physiological data, such as subglottal pressure and fundamental frequency relating to any desired word group or phrase, is disclosed in chapter 4 of Intonation, Perception and Len guage, by Philip Lieberman, MIT Press 1967. Diagrams representative of subglottal pressure and fundamental frequency related to certain Word groups are also to be found in the Lieberman reference. These data are stored for future use by function generators 12 and 13.
The outputs of generators 12 and 13, i.e., P and T(t), respectively, are supplied to excitation signal generator 20 where, in accordance with this invention, a vocal tract excitation signal, U is generated. Details of generator and its operation are discussed below with reference to FIG. 2.
Excitation signal U vocal tract area signals Av through Av and vocal tract length control signals [v and lv are applied to vocal tract synthesizer 30. Each of the applied vocal tract signals controls the value of a circuit element in the synthesizer which represents an element of the human vocal tract. Excitation signal U represents the air flow into the human vocal tract. Synthesizer 30, of known construction, responds to the applied control signals and develops artificial speech sounds which may be used to energize a transducer such as loudspeaker 32.
In addition to voiced output signals generated in response to vocal tract and excitation control signals, an unvoiced signal output must be generated. This is accomplished by inserting an appropriate noise source within vocal tract synthesizer 30 in a fashion also well known in the art. A suitable transmission line type of synthesizer which may be used in the practice of the present invention is discussed in detail in Dynamic Analog Speech Synthesizer, by George Rosen in The Journal of the Acoustical Society of America, March 1958.
FIG. 2 shows in block schematic form the detail of excitation generator 20. Apparatus is utilized including: an orifice circuit which comprises variable resistor 21, variable resistor 22, and variable inductor 23; delay circuit and glottal area function generator 24.
The vocal cord orifice, or glottal impedance, may be approximated by a series combination of resistors and an inductor that vary in value in proportion to the glottal area and excitation signal. Accordingly:
The variable resistor can be either a rheostat driven by a servomechanism or an appropriately biased field effect transistor. Variable inductor 23, again, may be one which is controlled by a servomechanism and incorporates the proper amplifying means for scaling the glottal area signal to meet the conditions of Equation 3. Glottal area function generator 24, described in detail with reference to FIG. 7, develops automatically, in accordance with the invention, area function, A which, in turn, is applied to the orifice circuit components.
To generate glottal area function signal A certain physiological characteristics of the human glottal system must be considered. I have discovered that the vocal cords may be represented as a second order mechanical system that is driven by a nonlinear forcing function. A typical example of such a mechanical system is shown in FIG. 3A. In this instance, it is assumed that the vocal cords are symmetrical and that there is an elastic collision between them. The elastic collision is simulated by mass M striking block 70. Mass M is set at a value for a typical set of human vocal cords, B is a viscous damping factor selected to simulate cord rubbing, and K is the equivalent of vocal cord tension. The mass is shown, in FIG. 3A, at the phonation neutral position, that is, the position of the vocal cords just before commencing phonation. In this position, displacement is equal to x=0. An oscillating system of this type may be expressed mathematically as follows:
and
where F (t) represents a forcing function, (w-d) is the vocal cord surface area, and P and P are the inlet and outlet pressures, respectively, of the vocal cord orifice. Pressures P and P are related to the subglottal pressure P, and Bernoulli pressure P in the vocal cord orifice according to the graph, as shown in FIG. 4, which is a diagram of pressure versus distance through the vocal cord orifice. Thus, as shown in FIG. 4, at distance 0 the vocal cord orifice pressure P is (P -1.37 P and, at distance d the vocal cord orifice, output pressure P is equal to (-0.5 P The pressures can be negative in magnitude due to the Bernoulli pressure in the orifice. Thus, substituting appropriate values from the graph of FIG. 4 for P and P in Equation 5 Therefore, glottal area may be obtained by solving Equation 4 for displacement x and substituting in g go+( where A represents glottal area with the vocal cords in the neutral position, w represents the Width of the vocal cord orifice, and A U =o for x=x where Applying a signal representative of subglottal pressure P as shown in FIG. 5A, and vocal cord tension T(t), which is proportional to the square of fundamental frequency, as shown in FIG. SE, to glottal excitation source 20 (FIG. 2), the glottal excitation signal is generated as follows: an initial value neutral glottal area A is supplied to control variable resistors 21 and 22 and variable inductor 23 in order to produce an initial current value for excitation signal U Area signal A is delayed by time delay circuit 25, which may be simply a conventional delay line, and is fed back as delayed signal A to glottal area function generator 24. The delay is necessary so that the proper value of F (t) is generated. Thus, feeding back A and the corresponding value of U F (t) is generated, from which displacement x is obtained, and subsequently, a new value of A This process is iterated, thereby obtaining a continuous vocal tract excitation signal. Generator 20 is self-oscillating, requiring only the input of signals representative of subglottal pressure and vocal cord tension plus the loading effect of vocal tract synthesizer 30 to produce automatically an excitation signal U Such a signal is shown in FIG. 6A where glottal area A and excitation signal U are shown with respect to time.
A more accurate mechanical representation of the human vocal cords is shown in FIG. 3B and FIG. 3C. This approximation again assumes symmetrical vocal cords but in this instance the collision of the cords is assumed to be viscous, i.e., inelastic, or semielastic. The collision of mass M with dashpot 71 creates an additional damping B which causes the glottal area A and excitation signal U to remain at a zero value for a portion of each cycle. The resultant variations in the glottal area and excitation signals, as shown in FIG. 6B, are believed to be a more accurate representation of the corresponding human functions. The mathematical representation of such a system for a viscous collision is as follows:
and A =U =o, for xgx where x =A /w. The sequence for generating the glottal area signal and, in turn, the excitation signal, is as discussed above in relation to Equation 4.
FIG. 7 shows in block schematic form the details of glottal area function generator 24. The delayed area function signal A is squared in squaring circuit 40, scaled in amplifier 41, and applied to one input of divider 42. Excitation signal U corresponding to A is squared in squarer 43, scaled in amplifier 44, and applied to the other input of divider 42. The resultant signal k P is applied to one input of subtracter 45. Subglottal pressure signal P is applied to the other input of subtracter Where the applied signals are accordingly subtracted and scaled to generate a signal which represents forcing function EU). The forcing function signal and a signal representing vocal cord tension T(t) are fed into computing network 50, preferably an analog computer, where they are utilized to generate a displacement signal 1:. Computing circuits of the type utilized in the present invention are discussed in chapter 5 of Analog Computation, by Albert S. Jackson, McGraw-Hill 1960. Particularly, the instances of elastic and viscous collisions are discussed at pages 200 through 203 in the Analog Computation reference. In computing circuit '50, the vocal cord tension signal T(t) is used to control the coefficient K in Equations 4 and 9. This variation is in accordance with the voiced fundamental frequency i shown in FIG. 5B as follows:
TEKI
As a first approximation, the voiced fundamental frequency f (voiced) is The coefiicient relating to vocal cord tension, that is, K can be varied in accordance with principles well known in the analog computation art. The resultant output from network is displacement signal x. Displacement signal at is scaled by the factor w in amplifier 51 and added to area function signal A, in summing amplifier 52. The resultant sum represents the desired glottal area A Level detector 53 and switching transistors 54 and 55 are utilized to ensure that the glottal area signal and the excitation signal are equal to 0 for values of x x This feature may be employed, if desired, in addition to the limiters usually incorporated in computing network 50.
FIG. 8 shows in block schematic form the details of variable resistor 21 (of FIG. 2). According to the conditions of Equation 1 glottal area signal A is squared in squaring network and the resultant is applied to one input of divider 61. Excitation signal U is applied to circuit 62, such as a full wave rectifier, where its absolute magnitude lU l is established. This value is applied to the second input of divider 61. The output of the divider 61,
therefore,
k lU l A5 controls variable resistor 63. Variable resistor 63 may be any of a number of types, for example, it can be a rheostat that is driven by a servomecham'sm, or a field effect transistor that is appropriately biased to be varied by an input signal corresponding to Equation 1.
The above-described arrangements are, of course, merely illustrative of the application of the principles of this invention. Numerous other arrangements may be devised by those skilled in the art without departing from the spirit and scope of the invention. For example, the vocal cords may be represented as a dual mechanical oscillatory system in which each cord is simulated by an individual mechanical oscillator or further, each cord may be simulated by a matrix of interconnected masses and springs which are driven by an appropriate forcing function.
What is claimed is:
1. Apparatus for generating a vocal tract excitation signal which comprises,
means for generating signals representative of an ordered phonemic sequence,
means responsive to said signals representative of said phonemic sequence for generating a control signal representative of sub-glottal pressure,
means responsive to said signals representative of said phonemic sequence for generating a control signal representative of vocal cord tension, and
generating means responsive to said subglottal pressure and vocal cord tension control signals for generating a vocal tract excitation signal. 2. Apparatus for generating a vocal tract excitation signal which comprises, in combination,
means for generating coded signals representative of 'phonemes in a desired sequence,
first means responsive to said coded signals for deriving a signal representative of subglottal pressure for the duration of said sequence,
second means responsive to said coded signals for deriving a signal representative of vocal cord tension for the duration of said sequence,
function generator means responsive to said pressure and tension signals for generating a signal representative of glottal area, and
controllable impedance means responsive to said area signal for varying a signal applied to a vocal tract synthesizer.
3. Apparatus as defined in claim 2 further including means for delaying said glottal area signal a predetermined interval and wherein said function generator means comprises,
means responsive to said delayed glottal area signal and said signal applied to said vocal track synthesizer for generating a signal representative of the Bernoulli pressure within the vocal cord orifice, means responsive to said signal representative of sub glottal pressure and to said Bernoulli pressure signal for generating a signal representation of a forcing function F(t),
computing means supplied with said signal representation of forcing function F(t) and with a signal representative of said vocal cord tension for generating a glottal displacement signal x, said computing means being preset in accordance with preestablished vocal cord mass and viscous damping parameters, and
means responsive to said glottal displacement signal x for generating a signal representative of glottal area.
4. Apparatus for generating an excitation signal for energizing a vocal tract synthesizer which comprises, in combination:
means for generating a signalrepresentative of a sequence of phonemic symbols,
first means responsive to said phonemic sequence signal for generating a signal representative of subglottal pressure, second means responsive to said phonemic sequence signal for generating a signal representative of vocal cord tension, means responsive to said pressure and tension signals for generating a controlling signal, and means responsive to said controlling signal for varying signals supplied as excitation to a vocal tract synthesizer. 5. Apparatus as defined in claim 4 wherein said phonemic sequence signal generator comprises:
means for reading said phonemic sequences from a tape, and means for translating said sequence in coded signal form. 6. Apparatus as defined in claim 4 wherein said first signal generating means comprises:
means for storing information representative of subglottal pressure, and means for selectively reading out said information in response to said phonemic sequence signal to produce a signal. 7. A paratus as defined in claim 4 wherein said second generating means comprises:
means for storing information representative of vocal cord tension, and means for selectively reading out said information in response to said phonemic sequence signal to produce a signal. 8. Apparatus as defined in claim 4 wherein said control signal means comprises,
means responsive to a signal representative of said glottal area and to said excitation signal for generating a signal proportional to the Bernoulli pressure in the vocal cord orifice, means responsive to said subglottal pressure signal and to said Bernoulli pressure signal for generating a forcing function signal F(t), computing means for generating a glottal displacement signal x in response to said forcing function and said vocal cord tension, and means for generating a glottal area signal A in response to said glottal displacement signal. 9. Apparatus as defined in claim 4 wherein said varying means comprises a series of circuit including: first controllable resistive means, second controllable resistive means, and controllable inductor means.
10. In combination:
means for generating coded signals representative of an ordered sequence of phonemes,
storage means responsive to said coded signals for generating, for each sequence of phonemes, signals representative of subglottal pressure and vocal cord tension,
means responsive to said subglottal pressure and vocal cord tension signals for generating signals representative of subglottal area corresponding to said sequence phonemes,
a vocal tract synthesizer, and
controllable impedance means supplied with said subglottal pressure signal and responsive to said glottal area signal for controlling the excitation potential applied to said vocal tract synthesizer, said controllable impedance means including a first variable resistor, a second variable resistor and a variable inductor,
said subglottal area signal generating means including means for delaying said glottal area signal a predetermined interval,
means responsive to said delayed glottal area signal and to said excitation potential for generating a signal proportional to the Bernoulli pressure within the vocal cord orifice,
means responsive to said subglottal pressure signal and said Bernoulli pressure signal for generating a signal representative of a forcing function F(t).
computational means responsive to said forcing function signal F(t) and said vocal cord tension signal T(t) for generating a glottal displacement signal x, said computational means being preset in accordance 'with preestablished vocal cord mass and viscous damping parameters,
means responsive to said glottal displacement signal x for generating said glottal area signal A and means for limiting said glottal area signal to predetermined values.
No references cited.
KATHLEEN H. CLAFFY, Primary Examiner R. P. MYERS, Assistant Examiner
US664130A 1967-08-29 1967-08-29 Self-oscillating vocal tract excitation source Expired - Lifetime US3511932A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US66413067A 1967-08-29 1967-08-29

Publications (1)

Publication Number Publication Date
US3511932A true US3511932A (en) 1970-05-12

Family

ID=24664663

Family Applications (1)

Application Number Title Priority Date Filing Date
US664130A Expired - Lifetime US3511932A (en) 1967-08-29 1967-08-29 Self-oscillating vocal tract excitation source

Country Status (1)

Country Link
US (1) US3511932A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4337375A (en) * 1980-06-12 1982-06-29 Texas Instruments Incorporated Manually controllable data reading apparatus for speech synthesizers
WO1996033487A2 (en) * 1995-04-20 1996-10-24 Philips Electronics N.V. A method for speech synthesis hardware, and speech synthesis apparatus
US6470308B1 (en) * 1991-09-20 2002-10-22 Koninklijke Philips Electronics N.V. Human speech processing apparatus for detecting instants of glottal closure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4337375A (en) * 1980-06-12 1982-06-29 Texas Instruments Incorporated Manually controllable data reading apparatus for speech synthesizers
US6470308B1 (en) * 1991-09-20 2002-10-22 Koninklijke Philips Electronics N.V. Human speech processing apparatus for detecting instants of glottal closure
WO1996033487A2 (en) * 1995-04-20 1996-10-24 Philips Electronics N.V. A method for speech synthesis hardware, and speech synthesis apparatus
WO1996033487A3 (en) * 1995-04-20 1996-11-21 Philips Electronics Nv A method for speech synthesis hardware, and speech synthesis apparatus

Similar Documents

Publication Publication Date Title
US4624012A (en) Method and apparatus for converting voice characteristics of synthesized speech
EP0319178B1 (en) Speech synthesis
US4896359A (en) Speech synthesis system by rule using phonemes as systhesis units
HU176776B (en) Method and apparatus for synthetizing speech
Rabiner et al. Computer synthesis of speech by concatenation of formant-coded words
US3836717A (en) Speech synthesizer responsive to a digital command input
EP3770906A1 (en) Sound processing method, sound processing device, and program
US3511932A (en) Self-oscillating vocal tract excitation source
Hsieh et al. A speaking rate-controlled mandarin TTS system
US2339465A (en) System for the artificial production of vocal or other sounds
Sondhi Articulatory modeling: a possible role in concatenative text-to-speech synthesis
Karjalainen et al. Speech synthesis using warped linear prediction and neural networks
US3542955A (en) Automatic generation of voiceless excitation in a vocal-tract synthesizer
US3530248A (en) Synthesis of speech from code signals
JPS58168097A (en) Voice synthesizer
JPS58129500A (en) Singing voice synthesizer
JP3515268B2 (en) Speech synthesizer
JPS5880699A (en) Voice synthesizing system
Eady et al. Pitch assignment rules for speech synthesis by word concatenation
JPH0695696A (en) Speech synthesis system
JP2910587B2 (en) Speech synthesizer
Butler et al. Articulatory constraints on vocal tract area functions and their acoustic implications
JP2551041B2 (en) Speech synthesizer
May et al. Speech synthesis using allophones
JP4207237B2 (en) Speech synthesis apparatus and synthesis method thereof