WO2000033552A1

WO2000033552A1 - System and method for ip-based communication having speech generated text

Info

Publication number: WO2000033552A1
Application number: PCT/US1999/028215
Authority: WO
Inventors: Farzad Hiri
Original assignee: Ericsson Inc.
Priority date: 1998-11-30
Filing date: 1999-11-29
Publication date: 2000-06-08
Also published as: EP1135921B1; US6490550B1; AU1747200A; DE69923346T2; ES2232188T3; DE69923346D1; EP1135921A1

Abstract

A system and method for IP-based telephone communication utilizing speech-generated text is disclosed. The present invention includes an interface to the Internet for sending and receiving voice signals. In addition, the present invention generates text signals corresponding to voice signals generated by the user. A transmission signal is generated from the text signals and the voice signals for transmission over the Internet. Further, the present invention includes an application which is capable of receiving speech-generated data transmitted by another device, and concurrently displaying the speech-generated data to a user. In this way, a user is capable of easily communicating speech-generated information to another user during periods of voice signal loss.

Description

SYSTEM AND METHOD FOR IP-BASED COMMUNICATION HAVING

SPEECH GENERATED TEXT

BACKGROUND OF THE INVENTION Technical Field of the Invention

The present invention relates, in general, to improved IP-based communication and, particularly, to a system and method for providing speech-generated text to IP-based telephone communication.

Background and Objects of the Invention There is currently a movement towards enhancing the capabilities of computer networks, such as the Internet, to support traditional telephony operations. The goal is to provide quality voice communications over the packetized Internet . This capability is often referred to a voice over Internet Protocol (VoIP) . With the current Internet, for example, VoIP can provide acceptable voice communications at a greatly reduced cost to the user, when compared to traditional telephone tolls.

Recently, IP-based telephone systems have become capable of providing video along with voice communication. Standard personal computer (PC) plug-in hardware, such as Picturephone by 3 com, and various software applications provided for Internet telephony, such as CU-SeeMe software by White Pine and software by Vocaltech and Microsoft, allow transport ofboth voice and image data acrossthe Internet. In particular, such systems typically include a microphone, a speaker and a PC plug-in sound card for providing audio data to a user's PC, and a video camera and a PC plug-in video capture card for providing video data thereto . Upon the establishment of a connection between two or more PC S over the Internet, the audio and video data generated in one PC is packetized and transported over the Internet for display on the other PC, such as within a browser framework. In this way, PC users may view each other while simultaneously speaking to each other. Existing IP -based telephone systems having video communication capability typically allow a PC user to communicate text to the other PC user in communication therewith. Text entered via the keyboard of a first PC may be transported with the audio and video data over the Internet and displayed as text in a browser window on another PC in communication with the first PC.

These IP -based telephone systems are not without shortcomings. For instance, VoIP usually provides a markedly reduced quality of service relative to conventional long distance telephone services. Poorer voice quality, intermittent fading, and other interruptions are commonly encountered, especially during international calls. In response to periods of reduced quality of service, callers utilizing IP-based telephone systems frustratingly resort to communicating with typed text instead of attempting to communicate with voice. To further compound the problem, typed text data is transported at a slower rate than voice data. As a result, it is oftentimes quite difficult to engage in communication when one caller is providing voice data and the other caller is providing typed text data, due to the data transmission of voice and typed text being out of synch.

In the context of IP -based telephone systems having video communication capabilities, having to communicate with typed text during periods of reduced quality of voice communication may result in a relatively inexperienced typist being forced to look away from the video display when entering text, thereby reducing the value in being provided a real-time video display of the other caller. As a result, there is a need for an IP- based telephone system which addresses the inherent problems associated with VoIP telephone communication.

It is an object of the present invention to provide an IP -based telephone system which selectively automatically provides text during periods of substandard voice communication.

Another object of the present invention is to provide an IP-based telephone system which allows a caller to view received video data while concurrently transmitting video and speech-generated data. SUMMARY OF THE INVENTION

The present invention overcomes the shortcomings in the above-identified systems and satisfies a significant need for an IP-based telephone system having enhanced and easily utilized communication features.

According to a first embodiment of the present invention, there is provided an improved IP-based telephone system. The telephone system includes a combination of hardware, software and/or firmware employed in association with a conventional PC in order to perform VoIP communication. A video capture card communicatively connected to a PC preferably receives video input data from a video source as well as from a VoIP transmission. A sound card communicatively connected to the PC preferably receives audio input from a microphone as well as from a VoIP transmission. A speech recognition device operatively associated with the PC preferably receives the microphone audio data, recognizes speech patterns therein and generates text data representing the recognized speech patterns. The generated text data is included with the microphone-provided audio data and the video camera-generated video data for transport over the Internet to another PC . In this way, a PC user may communicate video, voice and voice-generated text data to another PC user.

The present system preferably further includes an application which receives VoIP video data and speech-generated text data which were transmitted by another PC . In the event the other PC transmits a signal or set of signals having video data, audio/voice data and speech-generated text data, the application preferably displays the video data and produces audible signals from the audio/voice data using a speaker. In addition, the IP application preferably presents the speech-generated text concurrently with the displayed video data. In this way, the PC user is able to read the speech-generated text while viewing the video data during periods when the VoIP audio signal transmission falters.

BRIEF DESCRIPTION OF THE DRAWINGS A more complete understanding of the system and method of the present invention may be obtained by reference to the following Detailed Description when taken in conjunction with the accompanying Drawings wherein:

Figure 1 is a functional block diagram of the present invention;

Figure 2 is the resulting display generated by the present invention;

Figure 3 is a flow chart illustrating the transmission of VoIP data according to the present invention; and

Figure 4 is a flow chart illustrating the reception of VoIP data according to the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EXEMPLARY EMBODIMENTS

The present invention will now be described more fully hereinafter with reference to the accompanying drawings in which a preferred embodiment of the invention is shown. This invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiment set forth herein. Rather, the embodiment is provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Referring to Figures 1 and 2, there is shown an improved IP -based telephone system 1 according to the present invention. Telephone system 1 preferably includes hardware, software and/or firmware operatively associated with a PC so as to provide VoIP communication having enhanced features. Telephone system 1 preferably includes an interface 2 for connection to a telephone network 3 , such as a public switched telephone network (PSTN). Interface 2 preferably includes a modem (not shown) or other interface circuitry for suitably transmitting signals generated by telephone system 1 and receiving signals transmitted thereto . Interface 2 is utilized for accessing Internet 4 via Internet service provider (ISP) 5. Consequently, interface 2 may include circuitry for converting data, such as voice and video data, into packet switched or packetized signals for transmission over Internet 4. Telephone system 1 preferably further includes one or more devices for generating voice data signals relating to the voice of the PC user for transport over Internet 4 and for receiving voice data signals therefrom for presentation to the PC user. System 1 may include a sound card 8 having a first port 8A for receiving voice data generated by microphone 9 and circuitry for suitably conditioning the received voice data for Internet transmission. Sound card 8 further includes a second port 8B for transmitting the conditioned voice data signals for transport over Internet 4 and for receiving VoIP signals therefrom. Circuitry for conditioning the received VoIP signals is included in sound card 8 for applying to speaker(s) 10 via sound card port 8C. It is understood that sound card 8 may be a PC plug-in device or another hardware/firmware device associated with a PC.

As stated above, it is desirable to be able to communicate to another caller during periods of voice signal loss. Accordingly, the present invention includes a device for selectively generating text data from a received VoIP signal. In a preferred embodiment of the invention, the text generating device is a speech recognition engine 1 1. Speech recognition engine 11 preferably is an obj ect which receives from sound card 8 the audio signal generated by microphone 9, processes the audio signal using grammar database object 12 and generates text data corresponding to the processed audio signal. The generated text data is then available for transmission with the audio signal for transmission over Internet 4. When the transmitted audio and corresponding text signal is received at the receiving PC, the audio signal and the text data corresponding thereto are presented to the user thereof. By sending both voice and corresponding text to the receiving caller, the sending caller is able to fully and clearly communicate therewith.

As previously mentioned, sound card 8 receives and conditions a VoIP signal transmitted over Internet 4 by another caller, and applies the VoIP signal to speaker 10. Inorderforacallertodiscerntextdatageneratedbyaspeech engine 1 1 at another PC and transmitted thereby over Internet 4, telephone system 1 preferably includes an application 13 which displays the text data to the PC user. Application 13 preferably displays the text data substantially in concert with the application of its corresponding audio signal to speaker 10. By way of one example, application 13 displays the text data as scrolling text in a window within a browser framework As a result, the PC user at the receiving PC is able to both hear and see the voice message generated by the transmitting PC. Significantly, in the event the voice data becomes temporarily distorted, the PC user at the receiving PC is still able to discern the voice message by simply reading the text as the text is displayed on the receiving PC

In addition to communicating voice and voice-generated text, telephone system 1 may preferably further include the capability to communicate video between two or more callers over Internet 4 To this end, telephone system 1 may include a video capture card 14 having a first port 14 A and corresponding receiver for receiving video data from a video source 7, such as a video camera, and a second port 14B which is coupled to interface 2

Video data captured by video source 7 is transmitted to video capture card 14 and suitably conditioned for sending over Internet 4. Sound card 8, speech engine 11 and video capture card 14 are preferably synchronized so that voice signals, voice-generated text and video signals are substantially concurrently presented to the user at the receiving PC Video capture card 14 preferably further includes a receiver coupled to port 14B for receiving video signals transmitted by another PC over Internet 2, circuitry for conditioning and/or extracting video data from the received video signals, and a third port 14C for sending the conditioned video data to application 13. Upon reception by application 13, application 13 preferably concurrently displays to the PC user the conditioned video corresponding to the video signal and images corresponding to the voice- generated text In a preferred embodiment, the displayed video may be presented in one window 20 and the voice-generated text images may presented in another window 21 on PC monitor 22, as shown in Figure 2. This arrangement may be in accordance with a browser format generated by application 13 Alternatively, both the displayed video and the voice-generated text images may be presented in the same window on monitor 22

Further, application 13 may present typed text generated from a sending PC user using the keyboard of the sending PC The typed text may be presented in a third window 23 on monitor 22 The present invention may further include a device which receives atext signal from a sending PC and generates audible speech corresponding thereto . The device preferably comprises speech generation circuitry 15 which receives as its input text data, such as typed text, and generates audio signals for application to speaker 10. In this way, a speech- impaired individual may orally communicate to another PC user by typing text into the individual's PC keyboard.

The operation of the present invention in transmitting information to another PC will now be described with reference to Figure 3. In response to the PC user speaking into microphone 9, a voice signal is generated which is received by sound card 8 at step 30. Video camera 7 generates a video signal which is received by video capture card 14 at step

31. The video signal is generated substantially simultaneously with the generated voice signal such that sound card 8 receives the voice signal at the substantially the same time as video capture card 14 receives the video signal. Next, speech engine 11 processes the voice data at step 32 to recognize speech patterns therein. Speech engine 11 utilizes grammar database object 12 at step 33 to develop text data corresponding to the recognized speech patterns. Thereafter, sound card 8, video capture card 14 and interface 2 generates an IP-based signal at step 34 using the received voice and video signals as well as the generated text data. Subsequent to its generation, the IP-based signal is transmitted at step 35 for transport over Internet 4 to a receiving PC. The operation of the present invention in receiving information from another PC will now be described with reference to Figure 4. Initially, telephone system 1 receives an IP- based signal from another PC (the sending PC) at step 40. Next, interface 2, sound card 8 and video capture card 14 extract the voice signal, video signal and speech-generated text data from the received IP-based signal (step 41). Thereafter, a series of three operations are executed substantially simultaneously. The extracted voice signal is applied to speaker

10 at step 42 to generate audible signals based thereon. The video data and the speech- generated text data are conditioned and displayed to the PC user (the user at the receiving PC) at steps 43 and 44, respectively. In a preferred embodiment, the video data is displayed in one browser window and the text data displayed in a second browser window (Figure 2). Alternatively, the video and text data are displayed in the same browser window. The text data is preferably displayed as scrolling text, but alternatively the text data may be displayed and updated in other forms.

The invention being thus described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the spirit and scope ofthe invention, and all such modifications as would be obvious to one skilled in the art are intended to be included within the scope ofthe following claims.

Claims

WHAT IS CLAIMED IS:

1 A method of communicating in a telephone system, said method comprising the steps of receiving first voice data from an audio source, selectively generating a first voice text from said first voice data, and converting said first voice data and said selectively generated first voice text into a first packetized signal; and transmitting said first packetized signal over a packet switched network

2 The method of claim 1, further comprising the steps of receiving a second packetized signal; extracting a second voice text from said second packetized signal, and displaying said second voice text

3 The method of claim 2, further comprising the step of generating an audible signal from said second packetized signal during said displaying step

4 The method of claim 3 , wherein said step of generating an audible signal comprises the steps of extracting a second voice data from said second packetized signal; and applying said second voice data to a speaker

5 The method of 2, wherein said second packetized signal is a voice over IP (VoIP) signal

6 The method of claim 2, wherein said displaying step comprises the step of displaying said second voice text on a monitor

7 The method of claim 2, further comprising the steps of extracting a second video data from said second packetized signal, and displaying said second video data with said second voice text during said step of displaying

8 The method of claim 2, further comprising the steps of generating a second voice data from said second voice text, and applying said second voice data to a speaker during said step of displaying said second voice text

9 The method of claim 1 , wherein the step of selectively generating said first voice text further comprises the steps of processing the first voice data, applying a word list to the processed first voice data, and creating said first voice text responsive to the step of applying said word list

10 The method of claim 1 , wherein said step of generating said first voice text further comprises the step of recognizing one or more speech patterns within said first voice data

11 The method of claim 1, further comprising the step of receiving a first video data from a video source, said converting step converting said first voice data, said first video data and said first voice text into said first packetized signal

12 A telephone system, comprising a first receiver for receiving a first voice signal from an audio source, a speech recognition device, in communication with said first receiver, for selectively generating a first voice text signal based upon said first voice signal, an interface, coupled to said speech recognition device, for converting said first voice signal and said selectively generated first voice text signal into a first packetized signal; and a transmitter, in communication with said interface, for transmitting said first packetized signal over a packet switched network.

13. The telephone system of claim 12, further comprising: a second receiver for receiving a second packetized signal; and circuitry for extracting a second voice text signal from said second packetized signal for display on a monitor.

14. The telephone system of claim 13 , wherein said circuitry extracts an audio signal from said second packetized signal for application to a speaker.

15. The telephone system of claim 13 , wherein said second packetized signal is a voice over IP (VoIP) signal.

16. The telephone system of claim 13, wherein said circuitry extracts a video signal from said second packetized signal.

17. The telephone system of claim 13, further comprising: a text application for presenting text associated with said second voice text signal on said monitor.

18. The telephone system of claim 17, wherein: said circuitry extracts a video signal from said second packetized signal; and said text application presents video associated with said extracted video signal on said monitor concurrently with said associated text presented thereon.

19. The telephone system of claim 12, wherein said speech recognition device comprises a speech engine and a language object accessible thereby.

20. The telephone system of claim 12, further comprising: a second receiver for receiving a second packetized signal; circuitry for extracting a second voice text signal from said second packetized signal; a speech generation device for generating a speech signal from said second voice text signal; and a speaker operatively coupled to said speech generation device for producing an audible signal from said speech signal.

21. An IP-based telephone system, comprising: a first receiver for receiving an audio signal from an audio source; first circuitry for generating a first packet switched signal from said audio signal received by said first receiver; a transmitter for transmitting said first packet switched signal; a second receiver for receiving a second packet switched signal having an audio text signal therein; and a text application for displaying a plurality of text images corresponding to said audio text signal.

22. The IP-based telephone system of claim 21 , wherein said second packet switched signal includes a second audio signal, said IP -based telephone system further comprising a speaker and circuitry for transmitting said second audio signal thereto.

23. The IP-based telephone system of claim 21 , wherein said second packet switched signal includes video data, said text application concurrently displaying said video data and said audio text signal to a system user.

24 The IP -based telephone system of claim 21, further comprising a speech recognition device, in communication with said first receiver, for generating text data from said audio signal, said first packet switched signal comprising said audio signal and said text data

25 The IP-based telephone system of claim 21, further comprising a third receiver, in communication with said first circuitry, for receiving video data from a video device, said first packet switched signal comprising said audio signal and said video data generated by the video device

26 A method for communicating over the Internet, said method comprising the steps of receiving an audio signal from an audio source, generating a first IP -based signal from said received audio signal, transmitting the first IP-based signal, receiving a second IP-based signal having an audio text signal therein, and displaying text images corresponding to said audio text signal

27 The method of claim 26, wherein said second IP -based signal further comprises a second audio signal, said method further comprising the step of transmitting said second audio signal to a speaker

28 The method of claim 26, wherein said second IP-based signal includes video data, and said method further comprises the step of displaying said video data and said audio text signal corresponding to said second

IP-based signal

29. The method of claim 26, further comprising the step of: generating text data based upon said audio signal received from said audio source, said first IP-based signal being based upon said received audio signal and said generated text data.