WO2000018100A2 - Interactive voice dialog application platform and methods for using the same - Google Patents

Interactive voice dialog application platform and methods for using the same Download PDF

Info

Publication number
WO2000018100A2
WO2000018100A2 PCT/US1999/022145 US9922145W WO0018100A2 WO 2000018100 A2 WO2000018100 A2 WO 2000018100A2 US 9922145 W US9922145 W US 9922145W WO 0018100 A2 WO0018100 A2 WO 0018100A2
Authority
WO
WIPO (PCT)
Prior art keywords
list
voice command
user
voice
responses
Prior art date
Application number
PCT/US1999/022145
Other languages
French (fr)
Other versions
WO2000018100A3 (en
WO2000018100A9 (en
Inventor
William D. Livingston
Peter John Dingus
John D. Miller
Greg Dupertuis
Original Assignee
Crossmedia Networks Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Crossmedia Networks Corporation filed Critical Crossmedia Networks Corporation
Priority to AU64997/99A priority Critical patent/AU6499799A/en
Publication of WO2000018100A2 publication Critical patent/WO2000018100A2/en
Publication of WO2000018100A3 publication Critical patent/WO2000018100A3/en
Publication of WO2000018100A9 publication Critical patent/WO2000018100A9/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/06Message adaptation to terminal or network requirements
    • H04L51/066Format adaptation, e.g. format conversion or compression
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/42204Arrangements at the exchange for service or number selection by voice
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/53Centralised arrangements for recording incoming messages, i.e. mailbox systems
    • H04M3/5307Centralised arrangements for recording incoming messages, i.e. mailbox systems for recording messages comprising any combination of audio and non-audio components
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/53Centralised arrangements for recording incoming messages, i.e. mailbox systems
    • H04M3/533Voice mail systems
    • H04M3/53333Message receiving aspects
    • H04M3/53341Message reply
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/224Monitoring or handling of messages providing notification on incoming messages, e.g. pushed notifications of received messages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/40Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/45Aspects of automatic or semi-automatic exchanges related to voicemail messaging
    • H04M2203/4509Unified messaging with single point of access to voicemail and other mail or messaging systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/45Aspects of automatic or semi-automatic exchanges related to voicemail messaging
    • H04M2203/4536Voicemail combined with text-based messaging

Definitions

  • the present application includes material which is subject to copyright protection.
  • the copyright owner of the material in the present application has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office files or records, but otherwise reserves all copyrights whatsoever.
  • the present invention relates generally to a user interface platform which provides interactive voice controlled user access to a telephony or other computer-based application.
  • a specific application of the platform provides dial-in telephone access to a user's electronic mail, with advanced operation in response to voice commands .
  • Voice mail systems collect and deliver voice telephone messages.
  • e-mail systems generally receive text mail messages and deliver them to the intended recipient. In a business context, these systems may be readily accessible from the user's own office via telephone or computer, respectively. Access to incoming messages from other locations, such as while the user is traveling, may be more difficult.
  • Remote retrieval of e-mail has been particularly difficult since retrieval typically requires access to a computer connected to the same network as the user's mail server.
  • a typical solution to this problem has been to carry a portable computer when leaving the office, finding a telephone jack at a remote location, and dialing into the office network or the Internet to retrieve mail.
  • Another solution is to use an e-mail account which may be accessed using a web browser, and to find someone at the remote location who will give the user access to an Internet-connected browser on equipment at that remote location.
  • DTMF tones may be useful for simple commands (e.g., play message, delete message, and the like) , they do not readily allow a user to enter more complex commands (e.g., "forward message to John Smith") .
  • a traditional voice recognition system in order to interpret such commands, may require recognition of a complete dictionary of spoken words, and then a command intrepretor to interpret the converted text. If a word is misunderstood or not recognized, the complex command may not be executed.
  • such systems generally work best with high quality audio signals (e.g., high end sound card, microphone, quiet office environment) rather than a noisy, limited bandwidth (e.g., POTS) signal generated from a noisy environment (e.g., pay telephone) .
  • high quality audio signals e.g., high end sound card, microphone, quiet office environment
  • a noisy, limited bandwidth e.g., POTS
  • interaction with user generated voice commands can be an awkward and difficult system to use.
  • a user may not have patience to hear extended lists of menu items in order to respond to each message or decision.
  • any pauses or extended delays in processing voice data from a user may cause frustration or mis-communication on the part of the user.
  • a voice command entered by a user is not understood or is incorrect, a method of quickly telling the use that a mistake has been made may be required.
  • the present invention provides a system and method for allowing a user to hear text-based e-mail messages (as well as sound files such as .WAV files) to be played over an ordinary telephone.
  • the invention allows the user to respond to such messages with voice commands to generate reply messages, or reply sound files as well as input other e-mail commands (e.g., forward, delete, save, and the like) .
  • the system and method of the present invention provides such voice input by determining in advance, possible voice commands or responses which may be generated in response to a played message or input prompt .
  • Voice input signals are then compared to this limited list of possible responses (or "grammar") and the system generates a list of guesses as to which response has been spoken. Confidence levels are assigned to these guesses based upon the relative match between the actual response and the possible expected response. If the confidence level is above a dynamic threshold value, a match between the spoken response and the corresponding possible response is determined.
  • the system need not compare spoken responses with an entire English language dictionary of words in order to understand the user.
  • the present invention provides for additional responses to be input, including responses which may be unique to each user.
  • a user's e-mail address list may be input as text and then converted into corresponding speech patterns or models using a text-co-speech conversion program. These patterns may be co-catenated with speech models for a list of expected command responses, compiled, and loaded into the system.
  • the system may understand a voice command of "forward to John Smith” as two sub-commands; the term “forward” which is one of the possible expected responses to a message (along with “delete”, “save”, the like) , and the term “John Smith”, which may be from the user's e-mail address book, which, as described above, may be converted to speech models and co-catenated with the expected responses .
  • the system may dynamically learn, based upon frequency of use by a user, which phrases or commands are used more often. Based upon such usage, the mode of operation of the system may be dynamically tuned to minimize extraneous instructions and prompts. Thus, for example, when a user first uses the system, extensive prompts may be provided (e.g., "to save the message, say 'save' or press ' 1') . Once a user has used that command several times, the prompt may be shortened or deleted entirely.
  • the system and method of the present invention also provides a technique whereby replies may be generated to e-mail messages and transmitted to the sender.
  • a user may select one of a number of stored replies which the user may have previously generated.
  • the user may generate a voice reply which may be stored and transmitted as a sound file (e.g., .WAV) file which an e-mail recipient may play over a computer system.
  • a sound file e.g., .WAV
  • the present invention provides of means of notifying a user that e-mail messages have been received.
  • a user may selectively program the system (or the system may be pre- programmed) to notify a user via pager or telephone, that a message or messages have been received. Notification may be made for some or all messages. For example, if a high priority message, or a message from a particular sender is received, the user may be paged or otherwise notified that e-mail is has been received.
  • Figure 1 is a block schematic diagram of system architecture in the preferred embodiment.
  • Figure 2 is a block diagram providing an overview of a virtual session provided to the user in the preferred embodiment .
  • Figure 3 is a block schematic diagram illustrating flow of control from Telephony Voice Server 310 to the applications back- end for a single virtual session.
  • Figure 4 is a block schematic diagram illustrating flow of control for a 24 -channel (VS) front-end box from Telephony Voice Server 310 to the applications back-end.
  • Figures 5a and 5b are flow charts illustrating flow of control for a Virtual Session voice recognition system.
  • Figure 6 is a flow chart for Virtual Session context switching modules according to the preferred embodiment of the present invention.
  • Figure 7 is a flow chart illustrating a process for a Virtual Session initialization of a new context.
  • Figure 8 is a flow chart for a process of initialization and use of static grammar tokens according to the preferred embodiment .
  • Figure 9 is a flow chart for a Virtual Session spontaneous compile and loading of application supplied information into current grammars .
  • Figure 10 is a flow chart for a process of adding grammar token definitions to current grammar in a Virtual Session, according to the preferred embodiment of the present invention.
  • Figures 11a and lib are flow charts illustrating a phrase comparison process between spoken phrase and current Virtual Session grammar.
  • Figures 12a and 12b are flow charts illustrating an overview of a Command Resolver process for script interpretation.
  • Figures 13a and 13b are more detailed flow charts illustrating Command Resolver initialization and function.
  • Figures 14a, 14b, and 14c are flow charts illustrating a process for Virtual Session DTMF/IVR processing according to the preferred embodiment of the present invention.
  • Figure 15 is a state transition diagram for the telephony server process according to the present invention.
  • Figure 16 is a flow diagram illustrating the relationship between the processes in the message polling subsystem of the preferred embodiment of the present invention.
  • Figure 17 is a flow diagram illustrating the relationship between processes in the message receiving subsystem of the preferred embodiment of the present invention.
  • Figure 18 is a flow diagram illustrating the relationship between processes in the message sending subsystem of the preferred embodiment of the present invention.
  • Figure 19 is a system diagram illustrating the e-mail delivery systems of the preferred embodiment of the present invention.
  • Figure 20 is a flowchart illustrating a process of updating a user profile through a web-based interface, according to a preferred embodiment of the present invention.
  • Figure 21 is a dialog topology diagram.
  • Figure 1 is a schematic diagram of a computer network architecture useful for providing access to messages by voice control from a remote location.
  • the invention is preferably implemented through a computer network 100 including a voice interface server 102, database subsystem 104, file server subsystem 106, polling computer 108, mail sending computer 110, and web server 112.
  • Database subsystem 104 incorporates database server 120 and massive database 122, which may be an OracleTM database.
  • File server subsystem 106 incorporates file server 124 and message storage system 126 which may be a RAID disk array.
  • Voice interface server 102 may be coupled to database subsystem 104 via hub 114, and database subsystem 104 may be coupled to file server subsystem 106, polling computer 108, mail sending and receiving computer 110, and web server 112 through hub 116. Polling computer 108, mail sending and receiving computer 110, and web server 112 may be coupled via hub 118 to Internet 130.
  • the remaining computers and servers may also be coupled through hub 118 to Internet 130 to provide externally accessible network connections used for system administration.
  • Voice interface computer 102 may be coupled via conventional telephone system interface and switching equipment to the Public Switched Telephone Network (PSTN) 101 or to another telephone network (not shown) .
  • PSTN Public Switched Telephone Network
  • voice interface computer 102 provides a voice interface to system users through PSTN 101. Through this interface, users may dial in to the system, retrieve messages and take actions based on the messages such as placing telephone calls or replying to the messages.
  • Voice interface computer 102 may run Telephony Voice Server 310 software program ( Figure 3) which connects a user to the system via telephone and processes user requests via speech or IVR under script control.
  • Voice interface computer 102 also runs Application Proxy 330 and Automation Server 350 which contains the application API.
  • Database server 120 runs software programs implementing an e- mail message store and message delivery agent.
  • Polling computer 108 performs a POP3 mail polling function.
  • Mail sending and receiving computer 110 receives forwarded electronic mail for storage and delivery to users and forwards and sends e-mail in response to user commands .
  • Web server 112 implements a personal profiling information system which allows users to create and modify a personal profile for system operation from any Internet-connected computer using an industry-standard web browser or from the telephone using specific voice commands.
  • FIG 2 illustrates control flow for a preferred embodiment of Telephony Server (TS) .
  • the TS establishes as many Virtual Sessions as there are telephone lines capable of supporting digital speech.
  • Each Virtual Session (VS) interacts with a user under control of a script which the Command Resolver is currently running.
  • Each Virtual Session has a dedicated voice recognition, speech synthesizer, and applications interface.
  • Figure 3 is a block diagram illustrating flow control from Telephony Voice Server 310 to the applications back-end for a single virtual session.
  • PSTN 101 may be coupled to Telephony Voice Server 310 which may be running on voice interface server 102 of Figure 1.
  • the system implements an application interface which enables Telephony Voice Server 310 to have a network as its point of integration with an application.
  • Communications conduits connecting Telephony Voice Server 310 may be Local Named Pipe 320, for example, under TCP/IP.
  • the communications interface may be implemented on the same machine as Telephony Voice Server 310 but run as a separate multi-threaded NT Applications proxy 330 (the NT service) .
  • Figure 4 is a flow chart illustrating the flow of control for a 24 channel (VS) front -end box from Telephony Voice Server 310 to the applications back-end.
  • VS 24 channel
  • the Interactive Telephony Dialog Interface presented in the present invention presents a user with a flexible voice dialog system, accessible over the telephone, which allows a user to navigate and retrieve information by voice phrases and voiced connected digits, as well as by DTMF keypad strokes.
  • a flexible voice dialog system accessible over the telephone, which allows a user to navigate and retrieve information by voice phrases and voiced connected digits, as well as by DTMF keypad strokes.
  • the system may be con Figured to operate with more than one particular application, such as e-mail.
  • the interaction between a user and the system may be completely scripted using a script interpreter and an easy-to-use language specified as part of the present invention.
  • the system provides a user with a unique person/machine dialog-based interface on half-duplex (one-at-a-time conversation) or full-duplex (the system can be interrupted) telephone connections .
  • the rhythm of conversation between the user and system is maintained by a tight coupling between speech elements (Speech Recognizer, Text To Speech, and Digital File Playback and Record) and the Command Resolver which implements an Object Oriented State Machine.
  • speech elements Seech Recognizer, Text To Speech, and Digital File Playback and Record
  • the Command Resolver which implements an Object Oriented State Machine.
  • the service provider can quickly reconFigure the system for new applications or interactions (i.e., greetings, on-line help, application process, and the like) .
  • This functionality is implemented in a unique multi-threaded Virtual Session architecture which allows multiple users to simultaneously have independent dialogs with the system.
  • the system allows the user to mirror voice commands using keypad strokes under script control.
  • the system allows mirror image functionality at the complete discretion of the script writer.
  • the system also allows data entry and flow control via DTMF, all internally synchronized by the Virtual Session state machine.
  • the system implements a dynamic context-based hierarchy which allows the user to jump around within the tree structure of an application either under voice or DTMF control .
  • the result is a smaller active command phrase set which allows greater accuracy in noise and quicker response.
  • Tokens may be placed anywhere in a command script and have the properties of script variables as well as dialog enhancements . Tokens may be parsed as regular text string expressions for content, enabling quick phrase-action resolution.
  • the system may implement fully integrated User Interface features: a double tone when commands not understood, and a dynamic help facility which is context -dependant and script programmable .
  • the system implements an application interface which enables Telephony Voice Server 310 to have a network as its point of integration with the application.
  • the network may be local or it may be internet 130.
  • the communications conduits connecting Telephony Voice Server 310 may be Named Pipes or Sockets under TCP/IP.
  • the communications interface may be implemented on the same machine as Telephony Voice Server 310 but run as a separate multi -threaded NT service.
  • the present invention may also include a communications proxy (the NT service) and an applications protocol.
  • the system may run on dual Pentium computers under Windows NT4.0 or higher (Multi -tasking OS with threads and events) .
  • the system may use TAPI (Microsoft telephony API) as a Telephony interface, SAPI
  • the speech engines may comprise the AT&T Watson speech engines for Automatic Speech
  • the voice interface computer may incorporate various telephony boards .
  • appropriate telephone interface boards include the following: Rhetorex/Octel RDSP 432, Rhetorex/Octel RDSP 24000, Rhetorex/Octel VRS-24, Rhetorex/Octel RTNI-ATI/ASI 24 Trunk, Rhetorex/Octel RTNI-2T1, Natural Microsystems AG-24, Natural Microsystems T, Connect-24, or the like.
  • Telephony Voice Server 310 is a multi-threaded application written completely in C++. It may comprise three fundamental parts: The Dialog Thread, Telephony Monitor 240, and the Virtual Session.
  • the Dialog Thread is the Primary Thread in which the entire Server initializes itself, once launched.
  • the Server may be con Figured to operate in two modes at startup, depending on how the systems administrator wishes the Server to run.
  • the default Server runs as an NT service.
  • the Server runs on the NT desk top when launched with the
  • the Initialization procedure comprises the following functions :
  • a) Initialize the system log used to store statistical usage and runtime data.
  • b) Determine the number of available telephone lines having the required Media Modes and telephone control sets .
  • c) Based on the number of available telephone lines create a Virtual Session Data Storage Class to accommodate thread-safe session data for each Virtual Session, d) Launch a Virtual Session to service each system usable telephone line.
  • e) Launch a Telephony Monitor 240 Thread to capture and dispatch Telephone line control messages to the Virtual Session Threads.
  • Telephony Monitor 240 catches and dispatches messages associated with telephone control for each individual Virtual Session.
  • Telephone messages which the system monitors may comprise TAPI call control messages:
  • Telephony Monitor 240 catches call control messages using Event Wait States.
  • the Telephone Service Provider/Driver is con Figured to alert the application through NT Kernel Object Events.
  • Telephony Monitor 240 is not attached to the Primary thread of the Server, thus freeing it from blocking if the Primary Thread is processing windows messages while communicating with the user (Primary Thread contains the UI to the system administrator) .
  • the Service Provider Driver in this case the TAPI Service Provider (TSP)
  • TSP TAPI Service Provider
  • the system catches them in a Notification Event, decodes the message type as given above then sends the corresponding Virtual Session a windows message.
  • Telephony Monitor 240 The type of messages usually processed by Telephony Monitor 240 and dispatched to the appropriate Virtual Session are Connect, Disconnect, and DTMF. Call control handshakes including lineOffering and lineAnswer may be processed in Telephony Monitor 240. Only after a call has been established does Telephony Monitor 240 alert the Virtual Session via a Windows Message to the Virtual Session servicing that particular line.
  • More advanced call control functions such as outbound dialing and drop and insert functions, used to conference calls together, may be supported by the Virtual Session in a Telephony Class, Ccall. Therefore, in response to user commands via text-to-speech, the Virtual Session servicing the user may initiate telephony events on the line.
  • response of telephony interface by means of handshake messages is processed by Telephony Monitor 240.
  • Telephony handshakes in the TAPI model always include lineReply and lineCallState messages which are caught by Telephony Monitor 240.
  • the Virtual Session is the top level thread which handles all interactions with the user. There may be as many Virtual Sessions as active telephone lines.
  • the Virtual Session may be indexed and identified by a linelD, which may correspond to the Voice Processing device associated with a telephone line.
  • a Voice Processing system In order for an application to transfer digital speech to a physical telephony device there must be a Voice Processing system in place which performs the following functions:
  • a) Provide halfduplex input/output ports for each telephone line with associated Codec compression modes required by the ASR/TTS/wavefile components. Formats which may be supported include mu-law and 128kbps PCM, 16 bit, littendian digital format. b) Provide a fullduplex input/output port for each telephone line with associated Codec compression as above and echo- cancel . c) Provide an interface to switch voice ports on and off and provides switching capabilities for outbound calling and data stream switching.
  • Each Virtual Session has thread safe session data, which may contain: a) Telephone Information; b) Multi -Media device information; c) Virtual Session State Machine flags; d) Virtual Session Data Store; and e) Call Statistics information.
  • Each Virtual Session a) Creates an associated ASR engine via an ASREngineObject class; b) Creates an associated TTS engine proxy via the TTSEngineObject class; c) Creates a SubWorker communications thread which processes communications events from a remote TTS daughter process via bi-directional message pipes; d) Creates a RunScript Thread which processes NT events executing the CmdResolver to correlate speech-to-text user command phrases with associated actions embodied in dialog scripts the session is currently running; e) Creates an associated hidden window and message pump which provides the Virtual Session with the ability to process windows messages; and f) Sets up a bi-directional message mode pipe which serves as communication channel for the Server to an e-mail (or any other) applications Proxy.
  • the voice dialog system is context-based.
  • a context is defined as the set of phrases the system is con Figured to currently understand. All contexts available to the system at initialization time are dependent on initialization files.
  • the initialization file may contain the following scripts :
  • Each script may have a set of associated grammars and IVR maps which may be correlated to Exchanges which the user may have with the system.
  • the structure of command grammar, IVR map, and the associated Exchange is the following:
  • phrase4 When phrasel, phrase2 , !3, phrase4
  • the "when” line denotes the set of command phrases
  • the "!3" indicates that key pad 3 is associated with this Exchange
  • the Exchange itself is contained between the outer most curly brackets.
  • the Exchange correlates the command phrase and IVR map to the actions which the system will take if it decodes one of the phrases or the appropriate DTMF tone.
  • the contexts available to the system are stored in the array of context classes :
  • the context object above contains all information necessary for the Virtual Session to conduct the scripted exchanges with the user.
  • the command resolver uses the context object to correlate the request with the appropriate action.
  • Each script context has an associated context object.
  • the system has been con Figured so that IVR Mapping, i.e., telephone key pad keys are correlated to Exchanges in a Context in exactly the same way that phrases are correlated.
  • IVR Mapping i.e., telephone key pad keys are correlated to Exchanges in a Context in exactly the same way that phrases are correlated.
  • the only difference is that the origin of the user request via voice is through the Voice/Dialog system and the IVR is via the DTMF interface, which will be explained in more detail below.
  • the source of speech to text is ASR engine 510, a commercially available speech- to-text recognition system.
  • ASR engine 510 may be normalized into a standard interface for the system, which may be notification driven.
  • the notification system is modeled to be consistent with SAPI, which is the Microsoft speech standard.
  • Phrase Finish 512 is a function which is called when ASR engine 510 has a result to test.
  • Phrase Start 548 is a function which is called when ASR engine 510 begins to process a digital stream to try to correlate sounds with phrases in its active context.
  • Engine Idle 554 is a function which is called when ASR engine 510 has processed all of the digital information in the AudioSource buffers and begins to wait for new information to come in.
  • Barge-in 560 is a function which is called when ASR engine 510 encounters a barge- in token in a grammar phrase and has decoded the words to the left of that token.
  • Phrase Start 548 sets Phrase Timer 550.
  • Phrase Timer 550 marks the current time and notifies the Virtual Session via a Windows message when the time is up. Meanwhile, as illustrated in Figures 5a-5c, when ASR engine 510 reaches a result or when Phrase Timer 550 goes off, Phrase Finish 512 function is called. Phrase Finish 512 kills Phrase Timer 550 in step 514, and stops loading data in step 516, since the present recognition has been made.
  • Phrase Timer 550 is programmable via scripts and serves to speed resolution of recognition in noise. The programmable parameter in Phrase Timer 550, $Phrase_Time, is the time to wait before notifying the Virtual Session that Phrase Finish 512 should be called.
  • Phrase Finish 512 uses several state flags to resolve its decision tree . They are :
  • Phrase Finish 512 a determination is made whether a valid recognition exists in step 518 by checking to see that return structures of ASR engine 510 have a valid phrase (i.e., if a DTMF tone were heard or some non-white noise, ASR engine 510 might attempt a recognition) . Failure would be flagged by not presenting the application with a resulting phrase.
  • the system first checks to see whether the noise flag, m_phid, is set in step 532. If it is not set, the system sets it to True in step 534. If it is already set, then in step 536, the system flushes the AudioSource and resets ASR engine 510 environment tracking, then resets the flag to False indicating that the system has attempted to purge the noisy buffers. This noise flag checking step helps prevent noise from corrupting subsequent attempts at valid recognitions.
  • Phrase Finish 512 distinguishes DTMF tones from bad phrases or noise in the following way:
  • the system alerts the user that a mis-recognition occurred by playing a double tone (plink) in step 542. If the system plinks, the system sets the Virtual Session flag "playingbeep" , a boolean flag, to True in step 542 to prevent collisions between the TTS and plink. Processing in the loop ends at step 546.
  • Toggle-On is a method associated with the Engine Class, Sreng and will turn the AudioSource on if there are no state conflicts. Toggle-On and Toggle-Off are discussed in more detail below.
  • step 518 If the system has a valid result phrase as determined in step 518, it calls Toggle-Off in step 520 to prevent ASR engine 510 from interrupting the present processing, and then obtains from ASR engine 510 the confidence score for the best phrase in step 522 (ASR engine 510 may have several guesses at the phrase based on its confidence) . If the confidence score is below the confidence threshold, as determined in step 524, processing passes to routine 526.
  • Routine 526 is illustrated in more detail in Figure 5b.
  • the system first checks to see whether the noise flag, m_phid, is set in step 566. If it is not set, the system sets it to True in step 570. If it is already set, then in step 568, the system flushes the AudioSource and resets ASR engine 510 environment tracking, then resets the noise flag to False in step 572, indicating that the system has attempted to purge the noisy buffers.
  • the noise flag m_phid
  • the system then checks to see whether the noise was a DTMF tone in step 574 in a similar manner to step 538. If the noise was not DTMF, the system alerts the user that a mis-recognition occurred by playing a double tone (plink) in step 576. If the system plinks, the system sets the Virtual Session flag "playingbeep", a boolean flag, to True in step 576 to prevent collisions between the TTS and plink. Processing in the loop ends at step 580.
  • step 574 the system determines that it has paused for DTMF, it assumes the completion of non-terminated DTMF and resets the flag and restarts ASR engine 510 if Toggle-On state flags permit in step 578. Processing in the loop ends at step 580.
  • step 524 if the confidence level is greater than the threshold value in step 524, processing passes to the command resolver in step 528 and the loop ends at step 530.
  • the present invention also encompasses a system for tracking tracks density of mis-recognitions which, based on a noise density range of 0 to 1, readjusts the settings of ASR engine 510.
  • Noise density may be calculated as follows. Methods of the class Ftime count the number of mis-recognitions over total recognition attempts. Mis-recognitions and recognition attempts are counted in:
  • Noise-Floor A noise cut made on the input signal.
  • the noise cut may be between 0 and -50dbm.
  • the adjustment range is between -15dbm and -35dbm.
  • the default setting is 75, indicating a larger than 50% use of non VQ models. In noise the setting is changed to 100, indicating no VQ models should be used.
  • 1) Calculation of the noise density is done in PhraseFinish after each mis-recognition and in PhraseFinish after valid recognitions above threshold.
  • grammar_activated_ If grammar is not activated in the ASR, Toggle-On checks that none of the following states are set :
  • Each script designated by a file with a .scp postfix contained in the Session . ini file, defines a different context according to the Context Data Structures implemented in the present invention, as described previously.
  • the Virtual Session is designed as a context-based system in order to limit the number of phrases active in the recognizer at any given time, thus enhancing recognition accuracy and speed of the system.
  • a virtual session may switch context, be in the scope of a .scp file, in two ways:
  • the Virtual Session Upon initialization, the Virtual Session must start in a predetermined context, (i.e., the login context), which may be controlled by the telephony system. Telephony Monitor 240 notifies the Virtual Session via LINECALLSTATE Connected or Disconnected whenever the line serviced by the Virtual Session becomes active (caller calls the system) or becomes inactive (call hangs up) .
  • a predetermined context i.e., the login context
  • Telephony Monitor 240 notifies the Virtual Session via LINECALLSTATE Connected or Disconnected whenever the line serviced by the Virtual Session becomes active (caller calls the system) or becomes inactive (call hangs up) .
  • the Virtual Session executes the GetReadyForNewSession ( ) method to reinitialize the context of the system to the login script whenever Telephony Monitor 240 notifies the Virtual Session that a new call has been connected on its line.
  • the user may issue a command in some context which directs the system to go to another context. For example a script might contain the following exchange: when send reply, ! send reply, ! reply to message ⁇
  • step 614 Processing then passes to step 618 (through block 616 as illustrated in Figure 6) where the event is caught by the RunScript thread.
  • RunScript thread is launched at Virtual Session initialization time and is running in parallel with the Virtual Session as illustrated in Figure 2. If RunScript Proc 618 determines that a "When" Exchange has occurred in step 620, CmdResolver method ExCmd is called in step 624 with the Rec_Phrase . If a "When" Exchange has not occurred, RunScript Proc 618 looks for an other even in step 622. Method ExCmd 624 determines an index of the exchange via the Rec_Phrase as specified above with reference to the Determination cf Exchange Through Recognized Phrase.
  • Command Resolver calls the method CmdLoo .
  • CmdLoop Command Loop, determines how to execute the command in accordance with the operation of the Command Resolver as described herein. If this is a simple command, (i.e., not a compound nested command) CmdLoop will call the Resolver method HandleAction in step 628 for each Action in the Exchange. All Actions in an Exchange are members of a linked list. If HandleAction 628 determines the command is a Load in step 630, it captures its argument which is the name of the script to be loaded (the new context) . If HandleAction 628 determines that the command is not a Load in step 630, HandleAction 628 looks for an other action in step 632.
  • step 634 the script index is stored in Command Resolver
  • step 636 "current_context_” and the script name is stored in SD "script_name” in step 636.
  • the RunScript thread then sends the Event Hevntscriptmain in step 638, and processing of this stage ends at step 640. From step 638, Event Hevntscriptmain is caught by RunScript Proc in step 710 of Figure 7. The RunScript will then Proc decode Event Hevntscriptmain in step 712 and call the Command Resolver method InitNewContext in step 716. If the Event is not HevntScriptMain from step 712, RunScript Proc will look for another event in step 714.
  • InitNewContext then stops ASR engine 510 in step 710.
  • InitNewContext calls ASRLoad in step 722 with the new script index, current_context , then loads the current Exchange pointer, Pexch with the address of the Main Exchange for the new context (script) in step 724.
  • the pointer to the Main Exchange is found in the Context Class via the member "Pexch commands" as given above in the section relating to Context Data Structures.
  • the Main Exchange is explained above in the section relating to the MultiServer Scripting Language.
  • the Main Exchange is the default Exchange which is executed whenever a new context is entered.
  • InitNewContext then calls CmdLoop in step 726 which processes each of the Actions in the Main Exchange of the new script.
  • step 728 Since "flow control" in the script interpreter permits other Actions to occur while the TTS is still speaking, a WaitForTTSStopTalking is issued in step 728 since the system might come out of CmdLoop while the TTS is still talking. WaitForTTSStopTalking step 728 will block until the TTS stops, at this point the Main Exchange will have initialized the new context and the InScript flag is set to False at step 732. In step 732, ASR engine 510 is started and processing of this routine ends at step 734.
  • the dialog system uses embedded grammar tokens in command phrases for two reasons : a) As wild cards to append special sub-grammars to script command phrases . For example in the login script connected digits may be used as sub-grammars introduced into the command phrases as tokens, as the exact length of a pin code may not be known before entry (pin numbers may be between 7 and 16 digits) . Thus, the recursive nature of embedded sub-grammars is an efficient way to introduce variable grammars . b) As a way of introducing spontaneous, application related data, into command phrases.
  • each client of the system has a personal profile on the web server.
  • client specific data such as "Names to forward messages to”, “message replies to send”, “personal Rolodex”, and the like.
  • Grammar Tokens are a mechanism to get data from the outside world into the command phrases (our recognizer is constrained by grammars) .
  • Figure 8 illustrates the flow of control which occurs when the system initializes Static Tokens. Static Tokens are always initialized in the Main Exchange of a context as illustrated in step 810. Static Tokens correspond to case (a) above. When the system Loads a new context in step 810 the flow of control is as described previously. If the Script Interpreter finds static tokens in the Main Exchange in step 812, token processing proceeds as illustrated in Figure 8.
  • HandleAction 814 finds a static token definition, by parsing a regular expression like the following:
  • step 810 Every time a new context is initialized in step 810, the system searches, in steps 816, 820, and 822, the list of tokens it has compiled at initialization time from all the available contexts and fulfills the token rule definition given in the Main Exchange in step 824. From this list, it forms a new grammar rule in step 826 fulfilling the token part of the grammar.
  • This token rule is loaded into ASR engine 310 via a call to the Command Resolver method "AddToGrammar" in step 826 without releasing the main context grammar (the "main” context grammar is defined as the grammar in which the token anchors are embedded) or any other token rules that the Main Exchange might specify.
  • AddToGrammar accomplishes this by posting a windows message in step 828 to the VS.
  • "Our_grammar_token” member containing all information both on the token name and its associated grammar definition, stores all active tokens in order to facilitate matching between spoken phrases and context phrases containing tokens. If a static token definition for the token does not exist, processing passes to step 818.
  • Dynamic token definitions are updated via "our_grammar_token" in the Command Resolver method HandleAction during command resolution processing, as illustrated in step 930 in Figure 9.
  • the Virtual Session For the second type of tokens (b) , dynamic tokens, the Virtual Session
  • Grammar Object has two template lists of Grammar Token objects:
  • stl vector ⁇ GraramarToken*> grammar_tokens_; stl: : vector ⁇ GrammarToken*> grammar_tokens_free_list_;
  • the first list corresponds to the active list of grammar tokens and the second corresponds to the inactive list .
  • Already existing grammar token objects are not deleted in order not to fragment memory. Since these tokens are dynamic, (i.e., their definitions change constantly) , creating and deleting dynamic tokens would be a burden on the OS memory manager.
  • the Grammar Token Object has the following members:
  • the "name” member allows string manipulation on the token name so that in making comparisons to active tokens, the system may determine whether a token is already active with an obsolete definition or should be newly instated (See Figure 10) .
  • the "gram” member is a pointer to the ISGRAMCOMMON (this interface is specified in the Microsoft SAPI specification) interface to the ASR engine which allows the grammar rule corresponding to the token to be Activated, Deactivated, or Released.
  • the flow of control for spontaneous loading of applications-related grammar rules is illustrated in Figure 9.
  • Figure 10 is a flow diagram illustrating the operation of the AddToGrammar and SponLoad methods .
  • step 1110 when ASR engine 310 produces a result (a spoken utterance) it notifies the Virtual Session via a call to the PhraseFinish member of the ASR Engine class belonging to that particular VS. If the result is greater than threshold in step (as described previously) , the Rec_Phrse is passed to the method FeedFromASR where the flag "Listening_To_ASR" is set False after deciding to process the result.
  • step 1112 "Listening__to_ASR" is used by the CR, in FeedFromASR, to determine whether or not to process the result. If the flag "Listening_to_ASR" is set True, processing ends at step 1114.
  • FeedFromASR sends the windows message Hevntscriptwhen which will be caught by the RunScript thread in steps 1118 and 1120, and resolved so as to execute the Resolver method ExCmd in step 1124.
  • step 1136 the context is examined to determine whether it includes embedded tokens. If the context does not include embedded tokens, the Rec_Phrse is compared to the Context Phrases in step 1142 to find a match.
  • step 1142 processing passes to step 1148.
  • step 1148 a determination is made whether this is the last phrase. If not, processing passes to the next phrase in step 1150 and processing returns to step 1142. If it is the last phrase, the index is decremented and processing then ends at step 1154.
  • step 1138 a token count and token names in context phrases are retrieved. Processing then passes to Figure lib via block 1140.
  • the tokens are expanded in step 1156 according to their definition in the Resolver member our_grammer_token_list_, which is a list of all grammar tokens known in a particular script.
  • the expanded token versions of context phrases containing embedded tokens are then compared to Rec_Phrs in step 1158.
  • step 1158 If a match is not found in step 1158, a search is made for more context phrases in step 1166. If more phrases are found in step 1168, processing returns to step 1158 above. If no further context phrases are found, the index is decremented and processing ends at step 1162.
  • a Command Resolver is associated with each VS, and the operation of this Command Resolver in the preferred embodiment will now be described in more detail.
  • the Command Resolver takes a recognized utterance and performs the following actions:
  • Virtual Session creation is governed by system startup, as illustrated by step 1310 in Figure 13a.
  • the Command Resolver initializes a Virtual Session by reading the ".ini" file, DIF (Dot Ini File), associated with the particular VS.
  • DIF's and VS ' s are correlated by a Master Configuration File, MCF, for Telephony Voice Server 310.
  • MCF Master Configuration File
  • Telephony Voice Server 310 monitoring 24 different telephone lines, may run 24 different DIF's which are correlated to the 24 VS's monitoring those lines via the MCF in step 1312.
  • the MCF looks like this:
  • the contents of the DIF was described previously in the description of Context Data Structures.
  • a Virtual Session initializes it creates a Command Resolver object and a RunScript thread.
  • the Command Resolver object is initialized by parsing the script files in the DIF associated with it via the MCF.
  • the Command Resolver initialization creates and populates an array of Context objects associated with each script.
  • a map of class objects associated with the Command Resolver is illustrated in Figure 3. For example, if there are 10 script files in a DIF, then an array of 10 context objects is created for that VS. They are indexed via the Command Resolver members : int current_context_;
  • the "m_Pcmdcontext " member is an array of pointers to context objects corresponding to different scripts.
  • NT events are sent by the Virtual Session to RunScript the thread.
  • the Command Resolver runs in the context of the RunScript thread, as "flow control" is imposed on the script interpreter. Flow control means that for certain actions the script interpreter may block. These actions fall into two categories :
  • Actions which involve the TTS talking or wavefiles playing The system speaking to the user is a sequential operation. At most the system may be speaking and another speaking action may be queued, but they can't occur at the same time. Thus, the script interpreter must block (playing the next speaking action) until the first one has finished.
  • Actions which return data required to proceed This typically involves interactions with the application. Since Telephony Voice Server 310 is linked to the application (e-mail for instance) via a network, a finite time is required to receive a response to an application query. During this time the script interpreter must block pending the receipt of the required information.
  • BlockTillReady takes an argument which is an enumerated type.
  • the argument tells the function what NT event it should receive so as to stop blocking.
  • the argument may be any of the following:
  • EVT_WAKE_UP General stop blocking event .
  • EVT_TTS Sent when the TTS is finished talking.
  • EVT_DTMF Sent when DTMF is in. Resets "safe_to_plink" and "waiting_for_DTMF flags.
  • EVT_ASR Sent when ASR has finished loading grammar. Reset ASR_is_loading_grammar flag.
  • SYSTEM_KILL Sent when system is coming down.
  • EVT TTS ABRT Sent to not block on queued system speaking since a queued system speak has been aborted.
  • BlockTillReady receives an event, checks to see if the event corresponds to the one it was programmed to expect, and stops blocking if it finds a match. Depending on the event received, BlockTillReady also sets Command Resolver state flags to the appropriate state as illustrated above.
  • the basic task of the Command Resolver is to execute the Actions contained in the Exchange correlated to the spoken phrase. As illustrated in Figures 11a and lib, the Resolver first determines the correct Exchange, then it passes control to CmdLoop. The flow of control for CmdLoop is illustrated in Figures 12a and 12b. CmdLoop 1210 gets the first Action in the Exchange from the Pexch pointer in step 1212.
  • CmdLoop 1210 checks this Action to determine whether it is simple action or a Loop in step 1214. If the Action is a Loop in step 1216, the Action mnemonic is; Loop (Argument ) .
  • the structure of the Loop Action is as follows:
  • CmdLoop keeps track of the Loop argument and executes each of the instructions within the scope of the Loop as many times as the argument provides in step 1220.
  • CmdLoop determines whether the action is a subaction.
  • a subaction is one which is delineated by a pair of curly brackets and whose initial action was either a Loop or a Conditional.
  • Conditionals are Actions which execute a set of Actions dependent on the result of some condition being True or False.
  • An example of a conditional is as follows:
  • CmdLoop 1210 determines the number of times the loop should be executed by loading the variable nloop in step 1216. Otherwise nloop is set to 1 in step 1218 since the current Action should be executed once.
  • step 1230 i.e., a set of curly brackets
  • CmdLoop proceeds to execute the Loop until nloop has beer- decremented (in step 1242) to zero (determined at step 1228), at which point CmdLoop either executes the next action in the outermost scope or the next action is a Null. If the next action is a Null, as determined in step 1232, CmdLoop terminates in step 1240.
  • HandleAction is the Command Resolver method which parses and executes the current Action.
  • Applications commands handled by HandleAction are those which do not require interaction with the applications interface. For example, commands such as: Load, WaitForDTMF, Say, and the like do not require interaction with the application. Commands which do, such as: NextMessage, PreviousMessage, and the like are executed by ProcessAppAction .
  • ProcessAppAction communicates with the application via full duplex, message mode Named pipes . There is an applications pipe for each VS. ProcessAppAction parses and formats commands for the applications interface, sends the request, and buffers the response so that it becomes available to the Virtual Session via the Session Data buffer Fdbufout .
  • the flow of control for the Command Resolver is illustrated in Figure 13b.
  • Telephony Voice Server 310 offers 11 programmable keys per context including the * key.
  • the # key is reserved for a TTS or digital file play interrupt . Examples of a keypad key being mapped for IVR functions are:
  • keypad keys may be programmed as an alternate to voice commands in performing various exchanges.
  • keypad 3 is programmed as an alternative to the user saying "next message", in the second example entering a sequence of digits greater than one digit and terminated by the # sign defaults to the second exchange for which there is no voice alternative.
  • This instance of IVR mapping is used as a way of entering a string of digits as data, such as a Pin Number.
  • the Command Resolver treats voice commands and IVR maps in the same way. It associates exchanges with DTMF keys in the same way that it associates voiced phrases with an exchange index as explained above with reference to Context Data Structures .
  • Context of the script For example; email2.scp, forward. scp ... This denotes being in the context of a script proper, the system is in the "listen state" and all commands associated with that script are available.
  • ReadDTMF is a multi-character input function which maps a DTMF string to the value of a variable . For example
  • WaitForDTMF allows multiple DTMF entry terminated by the # sign.
  • the system waits for DTMF entry, a timer in this case defines an error condition, i.e., user has failed to enter DTMF.
  • Telephony Monitor 240 thread as a TAPI notification.
  • the system uses a TAPI generated event to signal a response.
  • the following TAPI messages are decoded:
  • the TAPI event interrupts Telephony Monitor 240 and resolves to LineMonitorDigits.
  • LineMonitorDigits function records the digits in Session Data (SD) , then according to the particular LinelD associated with the event LineMonitorDigits calls the FeedFromDTMF function associated with the Command Resolver created in the context of the Virtual Session associated with the current LinelD.
  • Flow of control for FeedFromDTMF is illustrated in Figure 10a.
  • FeedFromDTMF accumulates the digits as they come in, into the SD variable dtmf_.
  • FeedFromDTMF checks two state flags :
  • FeedFromDTMF is to make a determination based on state flags whether the system is in the context of the script or in the context of an Exchange. The biggest difference in the actions of the system, given these two different states, is that for continuous digit entry in the context of a script a "when" Exchange must be mapped to a "##" as described above.
  • the OOE (Out Of Exchange) flag indicates that the current context does support this mapping.
  • an Exchange WaitForDTMF allows multiple digit entry with a terminator.
  • the function DTMFFinished may stuff a script variable with a DTMF value if ReadyForDTMF is set or set the "Heventscriptdtmf " event allowing the RunScript thread to catch and decode an IVR map. If the WaitingForDTMF flag is detected the system calls DTMFDigitsAreAvailable which prompts BlockTillReady to look for a # sign in the current DTMF string.
  • This command encapsulates script actions which must be executed irrespective of the present state of the system.
  • the CriticalSection command was designed to ensure that critical code, typically initialization code, is always executed. Parts of Exchanges may be abort via the actions of the user. For instance the user may decide to IVR map into another Exchange before the present Exchange has had a chance to initialize all of its critical variables .
  • the CriticalSection command insures that the system is always in a known state.
  • the GetSay (ptext ) command is used to retrieve and speak text strings from the application.
  • GoTo Command Action Allows the system to GoTo an Exchange corresponding to pre-mapped IVR "When statement"
  • GetSay (ptext ) mail .
  • NextMessage ( ) $remaining mail .
  • Load ( ) Command Action
  • the Load command allows the system to load another script and jump from the present context to the context specified by the argument of the Load function.
  • the scriptname . scp argument must point to a valid script.
  • a valid script is one whose path may be resolved. In the present system all active scripts appear in specific directories given to the system via environment variables. Also, the script must appear in the ".ini" file for the system. This enables the Virtual Session parser to include the script in the system context tree.
  • An example of the Load command is :
  • Loo ( ) Command Action
  • the Loop command executes an Exchange the number of times specified in the argument of the Loop command.
  • Loop command executes a sub-Exchange the number of times specified in the argument of the command.
  • An example of the Loop command is as follows:
  • GetSay (ptext ) mail .
  • GetSay (ptext ) mail .
  • ReadNewMessage ( $result ) $continue False ⁇ ⁇ ⁇
  • the Main Exchange is the default Exchange for a script .
  • Each script may have only one Main Exchange.
  • a script is not required to have a Main Exchange, however if it has one the Main Exchange is automatically executed as soon as the script is loaded by the system.
  • Action nested Exchanges enable the script writer to execute a sub-Exchange within the scope of an Exchange given some appropriate command as an anchor.
  • Nested Exchanges may be nested as deeply as the script writer wishes.
  • Anchors for Nested Exchanges may be:
  • anchors may comprise language structures which allow execution of an Exchange based on the outcome of some test.
  • This command is used in order to play a recorded file
  • PlayWav command may be used to play recorded files.
  • An example of PlayWav is :
  • the ReadDTMF command is used to alert the system that logic in the current Exchange requires non-blocking keypad entry. ReadDTMF is equated to a script variable which holds the value of the next keypad entry the user makes .
  • ReadDTMF function is used for any Exchange which requires non-blocking DTMF entry.
  • the advantage of this function is that it works with a script variable thus all the operations which a script variable allows is available to it. For example:
  • variable result is initialized upon keypad entry.
  • the script may manipulate the result of keypad entry to :
  • the Say command spools text to the TTS, thus enabling the system to speak whatever text is contained in the argument of the Say command.
  • the text argument above is a text string which the script writer fills in. It may either be a constant string or it may be a string with imbedded script variables . For example :
  • &&& $forcast 'bright and sunny' a) Say the weather will be bright and sunny. b) Say the weather will be $forcast.
  • Action Script Variables may be defined by the script writer within the context of a script .
  • Script Variables have a content value and a Boolean value .
  • the "name" of the script variable may be any number of alphanumeric characters up to 20 characters in length.
  • the name of the variable may contain underscores, i.e., $welcome__complete, $old_messages and the like.
  • Each script variable has the scope of script, even if a script variable is defined in the context of an Exchange its meaning is valid throughout the script .
  • Each script variable has two values : a) Content value. The value of a literal string. b) Boolean value. If the variable has been initialized via its content value, its Boolean value is True, if the content value has yet to be initialized its Boolean value is False.
  • Script variables may also be inserted into text strings, i.e., :
  • $forecast 'Todays weather will be' $weather_forcast .
  • $forcast 'Todays weather will be hot and sunny' .
  • Terminate Command Action The Terminate command may be called in a script when the user has indicated they wish to terminate the current session. It reinitializes the Virtual Session to accept the next phone call.
  • Terminate command is when the user says "goodbye", i.e., the appropriate Exchange is as follows:
  • Token variables are variables which may become grammar rules. Token variables may be manipulated like script variables, i.e. , they may be equated to script variables, they may be inserted into a text string, they have a Boolean value of True or False, depending on their state of initialization.
  • Token variables may be associated with script dependent grammars or may represent additions to script grammars whose origin is the application. This is a way of incorporating into the script grammar information accessed through the application. The origin of this information may be databases, the network, Pirns, and the like. Token variables may also appear in Main Exchanges as script specific sub-grammars:
  • GetSay (ptext) mail .
  • SelectNewMessage ($ ⁇ digits>)
  • the Token variable ⁇ digits> is used in the grammar to include the user saying any combination of the constant grammar on the "When line” and an arbitrary string of connected digits specified by ⁇ digits>.
  • ⁇ digit>+ means any number of digits specified by the definition of ⁇ digits>.
  • the definition of ⁇ digits> is read; one or two or three or four ...or nine.
  • the exclamation in " ⁇ !digits> means that in the help system the system should speak the word "digits" in the command "please read message digit” instead of inserting the definition of the Token.
  • the system inserts the Token variable $ ⁇ digits> with the string corresponding to numbers which the user has spoken. For example if the Exchange was executed because the user has said “please read message five one” the Token variable inserted into “Getting new message” is "five one", thus the system says to the user "Getting new message five one” .
  • the WaitForDTMF command is used when exchange logic requires blocking, terminated DTMF entry. Syntax WaitForDTMF
  • WaitForTTS Command Action The WaitForTTS command is used in order to impose flow control on the execution of the script . Since the execution of the script does not necessarily block during the time the TTS is speaking, the script writer may impose this constraint in certain instances.
  • the first instance of the When statement above associates phrases, one through three, with the Exchange which follows it .
  • the Exchange is defined by the group of actions following the When line and is delineated by the outermost curly brackets enveloping the actions.
  • the exclamation marks preceding phrases two and three exclude these phrases from the automated help system. All phrases without preceding exclamation marks are included in an automated help system invoked by the function "SayHelpCommands" (see SayHelpCommands) .
  • the sequence "#number" in the first When statement denotes an IVR map to the keypad number "number” . Numbers preceded by the
  • # sign flag an IVR mapping between the When statement's Exchange and the keypad number following the # sign.
  • the double pound sign in the second When statement, ## denotes an association between keypad entry of a string of numbers and the Exchange associated with the corresponding When statement. For example, the "When ##
  • FIG. 15 is a state transition diagram for the Telephony Server system process according to the preferred embodiment of the invention. Each Virtual Session in the system has voice resources, play/record facilities, and a Command Resolver. The functional interrelation between these elements is illustrated in the Virtual Session system flow diagram. Referring to Figure 15, the state changes, as denoted by numbers in the flow diagram, are defined as follows:
  • Transition 4 Recognition event check threshold.
  • Transition 5 Recognition above threshold go to Resolver.
  • Transition 6 Good Exchange index, go to CmdLoop.
  • Transition 7 Exchange processed go to Listen state.
  • Transition 8 Exchange contains nested exchanges . Process nested exchange .
  • Transition 9 Action parsed in exchange. Handle Action. Transition 10 Action parsed in nested exchange. Handle Action. Transition 11 Action queues the system to speak. Transition 12: Stop speaking notification detected by system. Transition 13 : After TTS has completed go to next action in the exchange . Transition 14 : Handle Action has determined that this action requires communication with application. Actions of the form "application. function" require communication with the application. Transition 15 : Action requires addition of new grammar via grammar tokens . Transition 16 : TTS has stop and there are no more actions in exchange . Check system state . Transition 17: System state permits transition to Listen state. Transition 18: Context switching complete. Context initialized go to Listen state. Transition 19: Handle Action has found a Load command within the current exchange. System transitions to new context .
  • the electronic mail services provided in the preferred embodiment will now be described with reference to Figure 19.
  • the services provided are based on Internet mail standards .
  • Simple Mail Transport Protocol (SMTP) is used to exchange messages between Internet mail servers.
  • Post Office Protocol 3 (POP3) is utilized by Internet mail clients to retrieve messages.
  • the system implements each protocol, allowing it to receive and/or retrieve Internet e-mail messages for users. Users retrieve messages through a telephone interface.
  • the e-mail system comprises five primary components.
  • Message Polling subsystem 1910 retrieves e- mail messages using POP3.
  • Message Receiving subsystem 1912 receives messages from SMTP servers.
  • Message Delivery subsystem 1914 processes and stores messages in the Mylnbox system 1916.
  • Message Sender subsystem 1918 formats and send (via SMTP) outgoing replies and forwards.
  • Web Service 1920 provides user personal profile maintenance and system administrative tools.
  • the diagram in Figure 16 illustrates the relationships between the components of the message polling subsystem.
  • the Polling Subsystem actively retrieves messages by establishing POP3 connections to the user's electronic mail system. Available message are checked against the list of messages retrieved during previous sessions. Those messages identified as new are copied into the system.
  • Polling subsystem 1910 may comprise two components, Account Scheduler 1610 and Message Poller 1612. Generally, processing proceeds as follows: 1. The Poller requests an account from the Account Scheduler
  • the Scheduler selects an account from the database and returns it to the Poller.
  • the Poller attempts to establish a connection with the user's POP3 server. If successful, the Poller logs in, using credentials provided by the user during sign-up.
  • a list of available messages is retrieved and compared with those known to have been downloaded in a previous session. New messages are downloaded and processed by the Message Delivery Agent .
  • FIG. 17 illustrates the relationships between the components of message receiving subsystem 1912.
  • Message Receiving Subsystem 1919 receives messages sent to user's account via SMTP server 1710. Messages enter the system through a program called Metalnfo Sendmail 1712, an implementation of the industry standard SMTP server. Sendmail in turn invokes the Message Receiver's remaining components, Uagent program 1714 and Message Handler 1716. Generally, processing proceeds as follows:
  • An external SMTP server connects to the sendmail server and transmits a message.
  • Sendmail invokes Uagent, a specific implementation of a local delivery agent, or LDA.
  • the LDA' s responsibility is to deliver messages to a local user and indicate to sendmail whether the operation completed with or without errors .
  • Uagent in turn locates an Message Handler instance, reads the message, and hands-off delivers it to the Handler for further processing.
  • Message Delivery Agent 1914 process messages, storing summary information and text-to-speech translations in Oracle database 122. Complete message contents are inserted into a file system based message store.
  • Message Delivery Agent 1914 is not a free standing program, but an object component used by both inbound message processing subsystems. Its functions include:
  • Message Sender 1810 is responsible for the preparation and delivery of user-created reply and forward messages. In rather simple fashion, Message Sender 1810 monitors a queue of outgoing messages. As outgoing messages are discovered, messages are removed from the queue, prepared for delivery by sendmail 1812, and transmitted through SMTP server 1814. Generally, the processing steps are as follows:
  • a Sender monitors the outgoing message queue for new forwards and replies
  • the message is read, merged with user specific information, and formatted for delivery
  • the sendmail server is contacted for actual message delivery.
  • the Java Web Server while not directly involved with the receipt, processing or delivery of messages, hosts several critical interfaces. The overwhelming majority of these interfaces are implemented with the Java Servlet API .
  • End-user functionality includes registration, P0P3 account configuration, exclude and priority filters, predefined responses, and the personal directory.
  • Administrative interfaces include usage reports, corporate account management, server configuration, and service monitoring and control.
  • Figure 20 illustrates a process for creating or maintaining a user profile using a web-based interface.
  • the user accesses the server using an industry standard web browser from any Internet-connected computer.
  • the user identifies his account and enters a passcode to obtain access to his individual profile, as illustrated in Block 2004.
  • the user may then, as illustrated in Block 2006, enter personal directory information.
  • This information may include at least the first name, last name, and e-mail address of persons to whom e-mail messages may be regularly forwarded. If the name entered in the personal directory is difficult to pronounce, it is useful to spell the name phonetically or to use a nickname instead of a first and last name.
  • the personal directory may include other information such as telephone numbers .
  • the user may also, as illustrated in Block 2008, create and edit personalized, pre-set standard reply messages. Any number of these messages may be created and they may be updated at will .
  • the information entered includes a reply message name by which the reply message will be specified in the voice control mode.
  • a personalized message is entered. For example, the reply message name "Thanks” might be associated with the message "Thanks for the e-mail, I heard it while driving home and will get back to you . "
  • a message priority list may also be created in the user profile, as illustrated in Block 2010.
  • the user may enter any of the following in corresponding data fields: sender name, sender e- mail address, sender domain, subject line text keywords, and message body text keywords. In operation, if any of these fields match the corresponding characteristics of an incoming e-mail, that e-mail will be designated for priority delivery and will be delivered by voice e-mail before those messages not enjoying similar priority.
  • an exclude message list may be created and edited using the web browser interface (as illustrated in Block 2012) .
  • Messages may be excluded by sender name, sender e-mail address, sender domain, or subject line text.
  • account information may be reviewed, modified, and accounts cancelled if desired using the web browser personal profile interface (Block 2014) .
  • the present invention may be provided in a number of different embodiments, each of which may include various modifications and additional features.
  • One particularly significant feature of the preferred embodiment is web-based user profile entry. This feature permits the user to access his or her profile from any location using Internet 130, thereby customizing the operating of the user's account at will.
  • the user profile may include personal address lists, preferences with respect to the order in which e-mail messages are read during access (such as identifying particular senders for priority handling or which should not be read over the telephone—e . g. newsletters), form e-mail replies which are individualized for the particular user, names and keywords which are likely to be spoken by the user during mail retrieval.
  • the personal address lists include an entry and a telephone number associated with that entry, a voice dialing feature may be provided in which the voice command "dial ⁇ name>>" causes placement of a telephone call to ⁇ name>> from the personal address list.
  • the system may conduct searches in response to a voice command, based on the stored personal profile. For example, a search-for-sender function may be provided. ("read me the messages in my mailbox from Bill Clinton") . As another example, when a list of search keywords has been provided in the user profile, the system will load those keywords as vocabulary where appropriate, so that (for example) the user may request that mail including those keywords be read. For example, if "purchase order" is a keyword defined identified in the personal profile, the user may ask the system to "read me messages with subject: purchase order,” and the system will recognize the words "purchase order" and select those messages including the keywords as specified.
  • mail preprocessing is provided.
  • the mail preprocessing feature uses a table correlating certain symbol strings with other words. When the mail is processed, predetermined symbols or series of symbols are replaced by predetermined words before the mail is "read” to the user.
  • the equivalence table provides full equivalent phrases as replacements for commonly used acronyms, and provides aurally recognizable equivalent words or phrases as replacements for "emoticons.” For example, ";)" may be replaced by "wink.”
  • the system applies several unique features to increase processing speed.
  • the system loads a predetermined limited vocabulary which is context - appropriate to the function being performed. In this way, the system need only compare the user's spoken command to a limited number of possible vocabulary words or phrases to identify the intended command. Then, the system compares text strings rather than comparing recorded files while processing verbal commands.
  • the system uses a prompt when it is ready to receive a voice command.
  • this prompt is a "plink" sound. Failure to recognize the user's command as one of the current vocabulary items is indicated by a different prompt, such as a double plink.
  • the system is provided with specific methods of translating visual cues into audible cues in cases where an e-mail message includes such cues.
  • HTML pages contain a variety of formatting, including positioning, graphical features, and variations in text appearance.
  • Bold text, bullets, and other formatting may also be included in any message.
  • a standardized library of sounds, tones, words, changes in voice timbre, and other audible indicators are used as the message is read, in place of the formatting, to reflect visual presentation which is important to a full appreciation of the message, yet which would not otherwise be conveyed in a purely audible transmission.
  • the system preferably incorporates special methods for relaying a threaded e-mail to a user.
  • threaded e-mail refers to a message which is a forward of or reply to one or more messages and incorporates those previous messages in its text. Threaded e-mail may be identified, and individual messages within the e-mail may be parsed, by processing of message headers included in the e-mail text, counting leading > symbols placed before the message by e-mail clients, and other methods which take into account and process the format imposed on threaded messages by various e-mail clients.
  • the messages making up the e-mail may then be read or not read, selectively, based on the stored user profile.
  • the system When the user provides a spoken command during reading of e- mail, the system selectively responds to the command to either stop reading or continue reading, while implementing the command.
  • the stop/no stop operation is determined both by context and by the nature of the command. For example, the system does not stop reading on receipt of a "speak louder” or “speak faster” command, but stops in response to "send reply.”
  • the noise floor on the telephone line connecting the user to the system is detected, and the recognition threshold of the voice recognition engine is changed dynamically based on the level of the noise floor. If the noise level is high, a higher level of certainty may be required before recognition of a command occurs. If the noise level is low, it may be possible to recognize a command with a lower level of certainty.
  • the voice recognition engine operates based on a hidden Markov model
  • the depth of the Markov "tree" may be changed dynamically based on the noise level to achieve changes in the recognition threshold. In particular, the tree depth may be increased in the presence of more noise, and reduced in the presence of less noise.
  • Polling of e-mail addresses supplied by users preferably occurs adaptively. That is, users who historically receive a high volume of e-mail, or a high volume during particular periods in the day, will have their mailboxes polled more often (or more often during historical peak times) than users typically receiving a low volume of mail. For example, business users may receive little e- mail during the evening, while home users may receive more of their e-mail at those times. The time zones in which business users conduct most of their business may also impact e-mail delivery patterns. Whatever the pattern of typical e-mail delivery, it is generally desirable to poll mailboxes in proportion to the likelihood that there is actually mail to be retrieved. This feature of the invention makes it possible to efficiently allocate scarce bandwidth and computing resources directed to polling a large number of mailboxes, and contributes to the large scale capacity of the system according to the present invention.
  • an experience level indicator is maintained for each voice command, within each user profile.
  • the experience level indicator illustrates the user's familiarity with each available voice command.
  • the experience level indicator is changed to reflect that expertise. If the user has demonstrated successful use of a feature several times, then going forward, a reduced level of instruction and assistance may be provided in voice dialog scripts during use of the system when that feature is made available or is in use.
  • the present invention also provides for the voice commands from a user may be received and acted upon during either silence between output of the text-to-speech engine or during the time that the text-to-speech engine is sending voice to the user.
  • voice commands may be received only when the text-to-speech engine (or voice prompts) are not playing.
  • a "voice-barge-in" feature may be provided, whereby a user may talk over prompts or the text-to-speech engine with commands.
  • a user may say the command "cancel” or "stop” to stop reading of a message, as opposed to a DTMF input.
  • an echo cancellation circuit (similar to that used by a speak phone) may be used to prevent voice prompts or e- mail messages from being perceived as voice inputs .
  • One method of the present invention is to anticipate reaction to voice dialog and retrieve data in anticipation of such dialog voice.
  • a portion of such data may be initially read, such that the text-to-speech engine can read header data while other header data is being received so as to maintain a continuous speech output without interruptions or pauses which would be annoying to a user.
  • the present invention is embodied in an apparatus, system, and method employed by the assignee of the present application, CrossMedia.
  • the present invention may be demonstrated by calling 877-246-DEMO.
  • the XM Resource Manager abstracts the voice user interface application from the core speech technology engine. By leveraging this advanced feature, any application written with CrossMedia' s technologies may instantly take advantage of new innovations in voice technology as soon as they are commercially available.
  • the open architecture approach of the XM Resource Manager allows users and applications to be insulated from the complexities of the underlying voice technology, simplifying programming and speeding the adoption of new technology innovations .
  • XM Dialog Manager - CrossMedia incorporates both ASR and TTS engines, and manages real time allocation of speech resources including user specific grammar and vocabularies required for the effective development of voice dialog applications.
  • XM Scripting Language - CrossMedia provides a simple Applications Programmer Interface (API) enabling new applications to be developed quickly.
  • API Application Programmer Interface
  • GUI Graphical User Interface
  • the processor translates content into a clear format when read by a text -to-speech engine using CrossMedia' s TTS Conditioning.
  • the XM TTS Preprocessor does extensible parsing and translation to provide auditory meaning to information intended to be read.
  • the personal profiler enables users to set system preferences for use with CrossMedia' s Voice Email and Voice Activated Dialing products. Examples include telephone numbers, email addresses, standard email replies and email filters. This module is written as JAVA servlets with an SQL interface to the system database for storage.
  • This software provides a mechanism for getting copies of a user's email. This is accomplished in one of two ways: email polling and forwarding.
  • the software is written in JAVA and can be easily ported to various platforms .
  • XM Email Message Classifier - CrossMedia has developed a powerful, rules-based, message management system for classifying messages for filtering and routing.
  • the filtering function enables a user to hear only those messages deemed to be important, filtering out other messages .
  • XM Applications Gateway - CrossMedia will develop a family of Applications Gateways to access various email systems and database information sources.
  • the gateways will be developed in JAVA and may be architecturally distributed.
  • XM Resource Manager This provides expansion capability to meet the demands of large marketing partners .
  • the current system can handle over 50,000 mailboxes and can be expanded by installing additional servers to handle several hundred thousand to over one million mailboxes.
  • Context-sensitive Active Grammar This supports the recognition of millions of words and phrases needed for conversational voice accessible applications.

Abstract

A system and method allows a user to hear text-based e-mail messages (as well as sound files such as .WAV files) to be played over an ordinary telephone. A user may respond to such messages with voice commands to generate reply messages, or reply sound files as well as input other e-mail commands (e.g., forward, delete, save, and the like). Possible voice commands or responses which may be generated in response to a played message or input prompt are determined in advance. Voice input signals are then compared to this limited list of possible responses (or 'grammar') and the system generates a list of guesses as to which response has been spoken. Confidence levels are assigned to these guesses based upon the relative match between the actual reponse and the possible expected response. If the confidence level is above a dynamic threshold value, a match between the spoken response and the corresponding possible response is determined.

Description

INTERACTIVE VOICE DIALOG APPLICATION PLATFORM AND METHODS FOR USING THE SAME
CROSS-REFERENCE TO RELATED APPLICATIONS
The present application claims priority from Provisional U.S. Patent Application Ser. No. 60/101,930, filed September 24, 1998, the entire disclosure of which is incorporated herein by reference.
COPYRIGHT NOTICE
The present application includes material which is subject to copyright protection. The copyright owner of the material in the present application has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office files or records, but otherwise reserves all copyrights whatsoever.
FIELD OF THE INVENTION
The present invention relates generally to a user interface platform which provides interactive voice controlled user access to a telephony or other computer-based application. A specific application of the platform provides dial-in telephone access to a user's electronic mail, with advanced operation in response to voice commands .
BACKGROUND OF THE INVENTION
Traditionally, a number of separate systems have been provided to deliver messages in different formats. Voice mail systems collect and deliver voice telephone messages. Electronic mail
(i.e., "e-mail") systems generally receive text mail messages and deliver them to the intended recipient. In a business context, these systems may be readily accessible from the user's own office via telephone or computer, respectively. Access to incoming messages from other locations, such as while the user is traveling, may be more difficult.
Remote retrieval of e-mail has been particularly difficult since retrieval typically requires access to a computer connected to the same network as the user's mail server. A typical solution to this problem has been to carry a portable computer when leaving the office, finding a telephone jack at a remote location, and dialing into the office network or the Internet to retrieve mail. Another solution is to use an e-mail account which may be accessed using a web browser, and to find someone at the remote location who will give the user access to an Internet-connected browser on equipment at that remote location.
Efforts have been made to establish telephone-based interfaces for various business messaging systems, including systems which respond to voice control . One such system was developed by General Magic of Sunnyvale, California, and incorporates mobile agent technology described in U.S. Patent 5,603,031, incorporated herein by reference .
None of these solutions has been entirely satisfactory from the perspective of the mobile business user, and there is a need for improved systems and operating methods for providing messaging access for such mobile users.
For example, one problem with Prior Art systems has been the difficulty of using voice recognition systems as a means of user input. Voice recognition systems may be limited by ambient background noise and the like, particularly when used with portable phones. Moreover, many systems do little more than allow substitution of a limited number of spoken phrases as substitutes for DTMF telephone keypad inputs. While DTMF tones may be useful for simple commands (e.g., play message, delete message, and the like) , they do not readily allow a user to enter more complex commands (e.g., "forward message to John Smith") .
A traditional voice recognition system, in order to interpret such commands, may require recognition of a complete dictionary of spoken words, and then a command intrepretor to interpret the converted text. If a word is misunderstood or not recognized, the complex command may not be executed. In addition, such systems generally work best with high quality audio signals (e.g., high end sound card, microphone, quiet office environment) rather than a noisy, limited bandwidth (e.g., POTS) signal generated from a noisy environment (e.g., pay telephone) .
Moreover, interaction with user generated voice commands can be an awkward and difficult system to use. A user may not have patience to hear extended lists of menu items in order to respond to each message or decision. Moreover, any pauses or extended delays in processing voice data from a user may cause frustration or mis-communication on the part of the user. In addition, when a voice command entered by a user is not understood or is incorrect, a method of quickly telling the use that a mistake has been made may be required.
In addition, a system and method for sending replies to e-mail messages via voice commands remains a requirement of the Prior Art . Since Prior Art voice recognition systems do not work well with noisy telephone connections, replying to a received e-mail message may not be possible.
SUMMARY OF THE INVENTION
The present invention provides a system and method for allowing a user to hear text-based e-mail messages (as well as sound files such as .WAV files) to be played over an ordinary telephone. The invention allows the user to respond to such messages with voice commands to generate reply messages, or reply sound files as well as input other e-mail commands (e.g., forward, delete, save, and the like) .
The system and method of the present invention provides such voice input by determining in advance, possible voice commands or responses which may be generated in response to a played message or input prompt . Voice input signals are then compared to this limited list of possible responses (or "grammar") and the system generates a list of guesses as to which response has been spoken. Confidence levels are assigned to these guesses based upon the relative match between the actual response and the possible expected response. If the confidence level is above a dynamic threshold value, a match between the spoken response and the corresponding possible response is determined.
Using this method, the system need not compare spoken responses with an entire English language dictionary of words in order to understand the user. However, unlike simple DTMF replacement inputs, the present invention provides for additional responses to be input, including responses which may be unique to each user. For example, a user's e-mail address list may be input as text and then converted into corresponding speech patterns or models using a text-co-speech conversion program. These patterns may be co-catenated with speech models for a list of expected command responses, compiled, and loaded into the system.
Thus, the system may understand a voice command of "forward to John Smith" as two sub-commands; the term "forward" which is one of the possible expected responses to a message (along with "delete", "save", the like) , and the term "John Smith", which may be from the user's e-mail address book, which, as described above, may be converted to speech models and co-catenated with the expected responses .
The system may dynamically learn, based upon frequency of use by a user, which phrases or commands are used more often. Based upon such usage, the mode of operation of the system may be dynamically tuned to minimize extraneous instructions and prompts. Thus, for example, when a user first uses the system, extensive prompts may be provided (e.g., "to save the message, say 'save' or press ' 1') . Once a user has used that command several times, the prompt may be shortened or deleted entirely.
The system and method of the present invention also provides a technique whereby replies may be generated to e-mail messages and transmitted to the sender. A user may select one of a number of stored replies which the user may have previously generated. In addition, the user may generate a voice reply which may be stored and transmitted as a sound file (e.g., .WAV) file which an e-mail recipient may play over a computer system.
In addition, the present invention provides of means of notifying a user that e-mail messages have been received. A user may selectively program the system (or the system may be pre- programmed) to notify a user via pager or telephone, that a message or messages have been received. Notification may be made for some or all messages. For example, if a high priority message, or a message from a particular sender is received, the user may be paged or otherwise notified that e-mail is has been received.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block schematic diagram of system architecture in the preferred embodiment.
Figure 2 is a block diagram providing an overview of a virtual session provided to the user in the preferred embodiment .
Figure 3 is a block schematic diagram illustrating flow of control from Telephony Voice Server 310 to the applications back- end for a single virtual session.
Figure 4 is a block schematic diagram illustrating flow of control for a 24 -channel (VS) front-end box from Telephony Voice Server 310 to the applications back-end. Figures 5a and 5b are flow charts illustrating flow of control for a Virtual Session voice recognition system.
Figure 6 is a flow chart for Virtual Session context switching modules according to the preferred embodiment of the present invention.
Figure 7 is a flow chart illustrating a process for a Virtual Session initialization of a new context.
Figure 8 is a flow chart for a process of initialization and use of static grammar tokens according to the preferred embodiment .
Figure 9 is a flow chart for a Virtual Session spontaneous compile and loading of application supplied information into current grammars .
Figure 10 is a flow chart for a process of adding grammar token definitions to current grammar in a Virtual Session, according to the preferred embodiment of the present invention.
Figures 11a and lib are flow charts illustrating a phrase comparison process between spoken phrase and current Virtual Session grammar. Figures 12a and 12b are flow charts illustrating an overview of a Command Resolver process for script interpretation.
Figures 13a and 13b are more detailed flow charts illustrating Command Resolver initialization and function.
Figures 14a, 14b, and 14c are flow charts illustrating a process for Virtual Session DTMF/IVR processing according to the preferred embodiment of the present invention.
Figure 15 is a state transition diagram for the telephony server process according to the present invention.
Figure 16 is a flow diagram illustrating the relationship between the processes in the message polling subsystem of the preferred embodiment of the present invention.
Figure 17 is a flow diagram illustrating the relationship between processes in the message receiving subsystem of the preferred embodiment of the present invention.
Figure 18 is a flow diagram illustrating the relationship between processes in the message sending subsystem of the preferred embodiment of the present invention. Figure 19 is a system diagram illustrating the e-mail delivery systems of the preferred embodiment of the present invention.
Figure 20 is a flowchart illustrating a process of updating a user profile through a web-based interface, according to a preferred embodiment of the present invention.
Figure 21 is a dialog topology diagram.
DETAILED DESCRIPTION OF THE INVENTION
A preferred embodiment of the present invention will be described in detail with reference to Figure 1, which is a schematic diagram of a computer network architecture useful for providing access to messages by voice control from a remote location. As illustrated in Figure 1, the invention is preferably implemented through a computer network 100 including a voice interface server 102, database subsystem 104, file server subsystem 106, polling computer 108, mail sending computer 110, and web server 112. Database subsystem 104 incorporates database server 120 and massive database 122, which may be an Oracle™ database. File server subsystem 106 incorporates file server 124 and message storage system 126 which may be a RAID disk array. Voice interface server 102 may be coupled to database subsystem 104 via hub 114, and database subsystem 104 may be coupled to file server subsystem 106, polling computer 108, mail sending and receiving computer 110, and web server 112 through hub 116. Polling computer 108, mail sending and receiving computer 110, and web server 112 may be coupled via hub 118 to Internet 130.
The remaining computers and servers may also be coupled through hub 118 to Internet 130 to provide externally accessible network connections used for system administration. Voice interface computer 102 may be coupled via conventional telephone system interface and switching equipment to the Public Switched Telephone Network (PSTN) 101 or to another telephone network (not shown) .
In operation, voice interface computer 102 provides a voice interface to system users through PSTN 101. Through this interface, users may dial in to the system, retrieve messages and take actions based on the messages such as placing telephone calls or replying to the messages. Voice interface computer 102 may run Telephony Voice Server 310 software program (Figure 3) which connects a user to the system via telephone and processes user requests via speech or IVR under script control. Voice interface computer 102 also runs Application Proxy 330 and Automation Server 350 which contains the application API.
Database server 120 runs software programs implementing an e- mail message store and message delivery agent. Polling computer 108 performs a POP3 mail polling function. Mail sending and receiving computer 110 receives forwarded electronic mail for storage and delivery to users and forwards and sends e-mail in response to user commands .
Web server 112 implements a personal profiling information system which allows users to create and modify a personal profile for system operation from any Internet-connected computer using an industry-standard web browser or from the telephone using specific voice commands.
Figure 2 illustrates control flow for a preferred embodiment of Telephony Server (TS) . The TS establishes as many Virtual Sessions as there are telephone lines capable of supporting digital speech. Each Virtual Session (VS) interacts with a user under control of a script which the Command Resolver is currently running. Each Virtual Session has a dedicated voice recognition, speech synthesizer, and applications interface. Figure 3 is a block diagram illustrating flow control from Telephony Voice Server 310 to the applications back-end for a single virtual session. As illustrated in Figure 3, PSTN 101 may be coupled to Telephony Voice Server 310 which may be running on voice interface server 102 of Figure 1. The system implements an application interface which enables Telephony Voice Server 310 to have a network as its point of integration with an application. Communications conduits connecting Telephony Voice Server 310 may be Local Named Pipe 320, for example, under TCP/IP. The communications interface may be implemented on the same machine as Telephony Voice Server 310 but run as a separate multi-threaded NT Applications proxy 330 (the NT service) .
Figure 4 is a flow chart illustrating the flow of control for a 24 channel (VS) front -end box from Telephony Voice Server 310 to the applications back-end.
The Interactive Telephony Dialog Interface provided in the present invention presents a user with a flexible voice dialog system, accessible over the telephone, which allows a user to navigate and retrieve information by voice phrases and voiced connected digits, as well as by DTMF keypad strokes. Among the new capabilities which the present invention offers the provider and user of the invention are:
a) The system may be conFigured to operate with more than one particular application, such as e-mail. The interaction between a user and the system may be completely scripted using a script interpreter and an easy-to-use language specified as part of the present invention.
b) The system provides a user with a unique person/machine dialog-based interface on half-duplex (one-at-a-time conversation) or full-duplex (the system can be interrupted) telephone connections . The rhythm of conversation between the user and system is maintained by a tight coupling between speech elements (Speech Recognizer, Text To Speech, and Digital File Playback and Record) and the Command Resolver which implements an Object Oriented State Machine. Thus, the service provider can quickly reconFigure the system for new applications or interactions (i.e., greetings, on-line help, application process, and the like) . This functionality is implemented in a unique multi-threaded Virtual Session architecture which allows multiple users to simultaneously have independent dialogs with the system.
c) The system allows the user to mirror voice commands using keypad strokes under script control. The system allows mirror image functionality at the complete discretion of the script writer. The system also allows data entry and flow control via DTMF, all internally synchronized by the Virtual Session state machine.
d) The system implements a dynamic context-based hierarchy which allows the user to jump around within the tree structure of an application either under voice or DTMF control . The result is a smaller active command phrase set which allows greater accuracy in noise and quicker response.
e) The system can spontaneously compile data from the application so as to dynamically include outside information into its active grammar. These dynamic dialog entries, called Tokens, may be placed anywhere in a command script and have the properties of script variables as well as dialog enhancements . Tokens may be parsed as regular text string expressions for content, enabling quick phrase-action resolution.
f ) The system may implement fully integrated User Interface features: a double tone when commands not understood, and a dynamic help facility which is context -dependant and script programmable .
g) The system implements an application interface which enables Telephony Voice Server 310 to have a network as its point of integration with the application. The network may be local or it may be internet 130. The communications conduits connecting Telephony Voice Server 310 may be Named Pipes or Sockets under TCP/IP. The communications interface may be implemented on the same machine as Telephony Voice Server 310 but run as a separate multi -threaded NT service. The present invention may also include a communications proxy (the NT service) and an applications protocol.
These advantages and others will now be explained in more detail. Initially, those familiar with the relevant technical field will recognize that the system architecture illustrated in Figure 1 may be modified by providing different types of computers, combining any of the functions on fewer computers, and/or separating functions illustrated as performed within one computer onto two or more computers. For large scale applications, a plurality of like computers may be provided, operating in parallel, to perform any of the functions illustrated.
In the preferred embodiment illustrated in Figure 1, the system may run on dual Pentium computers under Windows NT4.0 or higher (Multi -tasking OS with threads and events) . The system may use TAPI (Microsoft telephony API) as a Telephony interface, SAPI
(Microsoft speech API) as a Speech Interface. The speech engines may comprise the AT&T Watson speech engines for Automatic Speech
Recognition and Synthesized speech. The voice interface computer may incorporate various telephony boards . Examples of appropriate telephone interface boards include the following: Rhetorex/Octel RDSP 432, Rhetorex/Octel RDSP 24000, Rhetorex/Octel VRS-24, Rhetorex/Octel RTNI-ATI/ASI 24 Trunk, Rhetorex/Octel RTNI-2T1, Natural Microsystems AG-24, Natural Microsystems T, Connect-24, or the like.
Operation of Telephony Voice Server 310 and its architecture will now be described in more detail with reference to Figures 5- 15. In the embodiment disclosed herein, Telephony Voice Server 310 is a multi-threaded application written completely in C++. It may comprise three fundamental parts: The Dialog Thread, Telephony Monitor 240, and the Virtual Session.
The Dialog Thread is the Primary Thread in which the entire Server initializes itself, once launched. The Server may be conFigured to operate in two modes at startup, depending on how the systems administrator wishes the Server to run.
a) The default Server runs as an NT service. b) The Server runs on the NT desk top when launched with the
-desktop flag.
The Initialization procedure comprises the following functions :
a) Initialize the system log used to store statistical usage and runtime data. b) Determine the number of available telephone lines having the required Media Modes and telephone control sets . c) Based on the number of available telephone lines create a Virtual Session Data Storage Class to accommodate thread-safe session data for each Virtual Session, d) Launch a Virtual Session to service each system usable telephone line. e) Launch a Telephony Monitor 240 Thread to capture and dispatch Telephone line control messages to the Virtual Session Threads. f ) Clean up system-associated threads and daughter processes at system termination time.
Telephony Monitor 240 catches and dispatches messages associated with telephone control for each individual Virtual Session. Telephone messages which the system monitors may comprise TAPI call control messages:
a) lineMonitorDigits b) lineCallState c) lineGenerateDigits d) lineGenerateTones
Telephony Monitor 240 catches call control messages using Event Wait States. The Telephone Service Provider/Driver is conFigured to alert the application through NT Kernel Object Events. Thus, Telephony Monitor 240 is not attached to the Primary thread of the Server, thus freeing it from blocking if the Primary Thread is processing windows messages while communicating with the user (Primary Thread contains the UI to the system administrator) . When the Service Provider Driver, in this case the TAPI Service Provider (TSP) , sends system call control messages, the system catches them in a Notification Event, decodes the message type as given above then sends the corresponding Virtual Session a windows message. The type of messages usually processed by Telephony Monitor 240 and dispatched to the appropriate Virtual Session are Connect, Disconnect, and DTMF. Call control handshakes including lineOffering and lineAnswer may be processed in Telephony Monitor 240. Only after a call has been established does Telephony Monitor 240 alert the Virtual Session via a Windows Message to the Virtual Session servicing that particular line.
More advanced call control functions such as outbound dialing and drop and insert functions, used to conference calls together, may be supported by the Virtual Session in a Telephony Class, Ccall. Therefore, in response to user commands via text-to-speech, the Virtual Session servicing the user may initiate telephony events on the line. However, response of telephony interface by means of handshake messages is processed by Telephony Monitor 240. Telephony handshakes in the TAPI model always include lineReply and lineCallState messages which are caught by Telephony Monitor 240.
The Virtual Session is the top level thread which handles all interactions with the user. There may be as many Virtual Sessions as active telephone lines. The Virtual Session may be indexed and identified by a linelD, which may correspond to the Voice Processing device associated with a telephone line. In order for an application to transfer digital speech to a physical telephony device there must be a Voice Processing system in place which performs the following functions:
a) Provide halfduplex input/output ports for each telephone line with associated Codec compression modes required by the ASR/TTS/wavefile components. Formats which may be supported include mu-law and 128kbps PCM, 16 bit, littendian digital format. b) Provide a fullduplex input/output port for each telephone line with associated Codec compression as above and echo- cancel . c) Provide an interface to switch voice ports on and off and provides switching capabilities for outbound calling and data stream switching.
The embodiment disclosed uses TAPI/Wave interfaces specified by Microsoft and implemented using the TAPI and Multi-Media interfaces native to Windows NT. Each Virtual Session has thread safe session data, which may contain: a) Telephone Information; b) Multi -Media device information; c) Virtual Session State Machine flags; d) Virtual Session Data Store; and e) Call Statistics information.
Each Virtual Session: a) Creates an associated ASR engine via an ASREngineObject class; b) Creates an associated TTS engine proxy via the TTSEngineObject class; c) Creates a SubWorker communications thread which processes communications events from a remote TTS daughter process via bi-directional message pipes; d) Creates a RunScript Thread which processes NT events executing the CmdResolver to correlate speech-to-text user command phrases with associated actions embodied in dialog scripts the session is currently running; e) Creates an associated hidden window and message pump which provides the Virtual Session with the ability to process windows messages; and f) Sets up a bi-directional message mode pipe which serves as communication channel for the Server to an e-mail (or any other) applications Proxy.
Context Data Structures used in the preferred embodiment will now be described in more detail . The voice dialog system is context-based. A context is defined as the set of phrases the system is conFigured to currently understand. All contexts available to the system at initialization time are dependent on initialization files. For example, in the preferred embodiment of the present invention, the initialization file may contain the following scripts :
LoginUser . scp GetUserKey .scp forward. scp email2. scp
DeleteMessage . scp
DeleteAllMessages . scp PrioritizeAddress . scp
PrioritizeDomain . scp
ExcludeDomain. scp
ExcludeAddress . scp
DeleteMessagesByDate . scp DeleteMessagesByName .scp
DictatedReply. scp
DictatedMessage . scp
TranscribeReply . scp
Transcribe . Message . scp ListMessagesByDate . scp
ListMessagesByNam . scp
TranscribeReply. scp
TranscribeMessage . scp
MoveForwardlnMessage . scp MoveBackwardlnMessage . scp
GistMessage . scp Each script may have a set of associated grammars and IVR maps which may be correlated to Exchanges which the user may have with the system. The structure of command grammar, IVR map, and the associated Exchange is the following:
When phrasel, phrase2 , !3, phrase4
{ actionl action2
actionN
The "when" line denotes the set of command phrases, the "!3" indicates that key pad 3 is associated with this Exchange, and the Exchange itself is contained between the outer most curly brackets. The Exchange correlates the command phrase and IVR map to the actions which the system will take if it decodes one of the phrases or the appropriate DTMF tone.
Any context ca have many structures such as illustrated above as well as different structures. The specification of these structures and the associated actions is the subject of the scripting language and the applications Interface. The structures which contain the relationships between the command phrases and Exchanges are unique to every Virtual Session in which the dialog system is embedded. They are:
The Exchange structure; struct exch
{ char *action struct exch *subaction struct exch *next } exch, exch *pexch;
The Grammar structure; struct GramCon
{
UINT NumGram char *GramCntxt [MXINX] char *Gramin [MXINX] pvoid Abuf [MXINX] } GramCon ; The Context class; Class Context
{
//For Main pexch commands
//For when int numphrs int numwhen int index [MAX_PHRASES] //Context Phrases
UTString phrases [MAX_PHRASES]
//Exchanges to phrases exch [index [I] ] where
// i is phrase in that context pexch exchang [MAX_PHRASES] //Grammar Tokens stl:: vector <UTToken *> grammar_token_list
//Help facility stl:: vector <UTString *> help_cmds
}
The contexts available to the system are stored in the array of context classes :
static Context *m_Pcmdcontext [SPTPMOD] ; and are indexed via the Command Resolver member "current context" .
The binding of the phrases to exchanges in a context is given by these structures. For a particular context available to a Virtual Session embodied in a script, the context object above contains all information necessary for the Virtual Session to conduct the scripted exchanges with the user. Once the dialog/IVR system determines what the user wants, the command resolver uses the context object to correlate the request with the appropriate action. Each script context has an associated context object. The bindings between the context phrases and exchanges are depicted in Table 1:
Figure imgf000031_0001
TABLE 1 - B n ng or eac en statement Dialog topology is as illustrated in Figure 21.
The system has been conFigured so that IVR Mapping, i.e., telephone key pad keys are correlated to Exchanges in a Context in exactly the same way that phrases are correlated. The only difference is that the origin of the user request via voice is through the Voice/Dialog system and the IVR is via the DTMF interface, which will be explained in more detail below.
Referring to Figures 5a and 5b, the operation of Speech Recognition Command Resolution in the system of the preferred embodiment will now be described in more detail. The source of speech to text is ASR engine 510, a commercially available speech- to-text recognition system. The interface of ASR engine 510 may be normalized into a standard interface for the system, which may be notification driven. Preferably, the notification system is modeled to be consistent with SAPI, which is the Microsoft speech standard.
The manner in which speech is integrated to text into a dialog system, processing of the text hypothesis so as to maximize the accuracy and speed of the resultant phrase, and the flow of control imposed on the dialog system, are novel. The four main parts of the Voice Dialog system are : a) ASR-> Speech to Text. b) TTS-> Speech Synthesis. c) Digitized File Play facility. d) Command Resolver.
Four ASR notifications are illustrated in Figures 5a-5c. Phrase Finish 512 is a function which is called when ASR engine 510 has a result to test. Phrase Start 548 is a function which is called when ASR engine 510 begins to process a digital stream to try to correlate sounds with phrases in its active context. Engine Idle 554 is a function which is called when ASR engine 510 has processed all of the digital information in the AudioSource buffers and begins to wait for new information to come in. Barge-in 560 is a function which is called when ASR engine 510 encounters a barge- in token in a grammar phrase and has decoded the words to the left of that token.
When ASR engine 510 begins working on digitized data and notifies the application via Phrase Start 548, Phrase Start 548 sets Phrase Timer 550. Phrase Timer 550 marks the current time and notifies the Virtual Session via a Windows message when the time is up. Meanwhile, as illustrated in Figures 5a-5c, when ASR engine 510 reaches a result or when Phrase Timer 550 goes off, Phrase Finish 512 function is called. Phrase Finish 512 kills Phrase Timer 550 in step 514, and stops loading data in step 516, since the present recognition has been made. Phrase Timer 550 is programmable via scripts and serves to speed resolution of recognition in noise. The programmable parameter in Phrase Timer 550, $Phrase_Time, is the time to wait before notifying the Virtual Session that Phrase Finish 512 should be called.
If ASR engine 510 works on a recognition longer than the wait time of Phrase Timer 550, a call to Phrase Finish 512 is forced. Phrase Finish 512 uses several state flags to resolve its decision tree . They are :
a) m_stpfl. Boolean flag, True = AudioSource is stopped, i.e., no longer sending digital buffers to ASR engine 510, False=AudioSource is running, ASR engine 510 may, at any time send another result. b) Grammar_activated. Boolean flag, True=not currently loading a new context, False=currently loading a new context . c) Disconnecting. Boolean flag, True=hangup (telephone line) in progress, False=hangup not in progress, d) m_phid. UINT flag, True=noisy condition on the line, False=line clear. Once Phrase Finish 512 is called a determination is made whether a valid recognition exists in step 518 by checking to see that return structures of ASR engine 510 have a valid phrase (i.e., if a DTMF tone were heard or some non-white noise, ASR engine 510 might attempt a recognition) . Failure would be flagged by not presenting the application with a resulting phrase.
If no valid result (phrase) exists the system first checks to see whether the noise flag, m_phid, is set in step 532. If it is not set, the system sets it to True in step 534. If it is already set, then in step 536, the system flushes the AudioSource and resets ASR engine 510 environment tracking, then resets the flag to False indicating that the system has attempted to purge the noisy buffers. This noise flag checking step helps prevent noise from corrupting subsequent attempts at valid recognitions.
The system then checks to see whether the noise was a DTMF tone in step 538. Phrase Finish 512 distinguishes DTMF tones from bad phrases or noise in the following way:
a) safe_to_plink . Boolean flag associated with command resolver. True=not expecting further DTMF digits. False=DTMF terminator not yet received. b) Waiting_for_DTMF. Boolean flag associated with command resolver. True=user input of DTMF string in progress, command script paused. False=not expecting DTMF from flow control, terminated DTMF entry.
If the noise was not DTMF, the system alerts the user that a mis-recognition occurred by playing a double tone (plink) in step 542. If the system plinks, the system sets the Virtual Session flag "playingbeep" , a boolean flag, to True in step 542 to prevent collisions between the TTS and plink. Processing in the loop ends at step 546.
In step 540, the system determines whether that it is not paused for DTMF but has not seen a terminator, safe_to_plink=False . If so, then the system assumes the completion of non-terminated DTMF and resets the flag and restarts ASR engine 510 if Toggle-On state flags permit in step 544. Processing in the loop ends at step 546. Toggle-On is a method associated with the Engine Class, Sreng and will turn the AudioSource on if there are no state conflicts. Toggle-On and Toggle-Off are discussed in more detail below.
If the system has a valid result phrase as determined in step 518, it calls Toggle-Off in step 520 to prevent ASR engine 510 from interrupting the present processing, and then obtains from ASR engine 510 the confidence score for the best phrase in step 522 (ASR engine 510 may have several guesses at the phrase based on its confidence) . If the confidence score is below the confidence threshold, as determined in step 524, processing passes to routine 526.
Routine 526 is illustrated in more detail in Figure 5b. As in step 532, the system first checks to see whether the noise flag, m_phid, is set in step 566. If it is not set, the system sets it to True in step 570. If it is already set, then in step 568, the system flushes the AudioSource and resets ASR engine 510 environment tracking, then resets the noise flag to False in step 572, indicating that the system has attempted to purge the noisy buffers.
The system then checks to see whether the noise was a DTMF tone in step 574 in a similar manner to step 538. If the noise was not DTMF, the system alerts the user that a mis-recognition occurred by playing a double tone (plink) in step 576. If the system plinks, the system sets the Virtual Session flag "playingbeep", a boolean flag, to True in step 576 to prevent collisions between the TTS and plink. Processing in the loop ends at step 580. If, in step 574, the system determines that it has paused for DTMF, it assumes the completion of non-terminated DTMF and resets the flag and restarts ASR engine 510 if Toggle-On state flags permit in step 578. Processing in the loop ends at step 580.
Returning to Figure 5a, if the confidence level is greater than the threshold value in step 524, processing passes to the command resolver in step 528 and the loop ends at step 530.
The present invention also encompasses a system for tracking tracks density of mis-recognitions which, based on a noise density range of 0 to 1, readjusts the settings of ASR engine 510. Noise density may be calculated as follows. Methods of the class Ftime count the number of mis-recognitions over total recognition attempts. Mis-recognitions and recognition attempts are counted in:
a) Cengnotify: : Interference . SAPI notification of conditions on the line. b) Cengnotify :: Sound . SAPI notification of noise on the line . c) PhraseFinish for a non-recognition event. d) PhraseFinish for a valid recognition above threshold. In this case only the denominator count is incremented, indicating good quality on the line.
1) The ratio, Density = (mis-recognitions) / (total attempts) is recalculated every 30 seconds.
2) Based on this ratio and the digital level on the line during the last recognition, a threshold decision is made as to whether to adjust parameters of ASR engine 510:
a) Noise-Floor. A noise cut made on the input signal. The noise cut may be between 0 and -50dbm. The adjustment range is between -15dbm and -35dbm. b) An adjustment of a variable which gauges the amount of VQ telephone speech models to use. VQ models are faster, but the full blown speech models give the best recognition in noise. The default setting is 75, indicating a larger than 50% use of non VQ models. In noise the setting is changed to 100, indicating no VQ models should be used. 1) Calculation of the noise density is done in PhraseFinish after each mis-recognition and in PhraseFinish after valid recognitions above threshold.
As discussed above, there may be instances where system decides to toggle on or off the AudioSource. There are two methods associated with the engine class; Toggle-On and Toggle-Off. The toggle functions track the state of the interaction between the TTS, ASR, and IVR systems. Based on these states, a decision is made to turn the AudioSource on or off. The set of states which queue Toggle-On are:
a) grammar_activated_. If grammar is not activated in the ASR, Toggle-On checks that none of the following states are set :
1) Iplayingmessage = not playing digital file.
2) ! IsTalking = TTS not talking.
3) ! layingbeep = not play "didn't understand tone.
4) ! waiting_for_DTMF = not waiting for terminated DTMF
5) ! InScript = not in a dialog exchange
6) ! getting_digits = not waiting for next (unterminated) DTMF digit.
If this state check does not inhibit Activation Toggle- On activates the grammar and turns on the AudioSource. If activation is successful, Toggle-On sets: m_stpfl=False grammar_active=True
Otherwise it sets: m_stpfl=True grammar_active=False b) grammar_activated. If grammar is activated, Toggle-On activates the AudioSource if m_stpfl=False and the flags in (a) above do not inhibit activating the AudioSource.
In a similar way the method Toggle-Off turns the AudioSource Off if the system state flags permit. Toggle-Off checks Virtual Session flag m_stpfl=False . If that condition is met, Toggle-Off stops the AudioSources and sets m_stpfl=True .
The method of context switching used in the preferred embodiment will now be described in more detail in connection with Figure 6. Each script, designated by a file with a .scp postfix contained in the Session . ini file, defines a different context according to the Context Data Structures implemented in the present invention, as described previously. The Virtual Session is designed as a context-based system in order to limit the number of phrases active in the recognizer at any given time, thus enhancing recognition accuracy and speed of the system. A virtual session may switch context, be in the scope of a .scp file, in two ways:
a) Upon initialization, the Virtual Session must start in a predetermined context, (i.e., the login context), which may be controlled by the telephony system. Telephony Monitor 240 notifies the Virtual Session via LINECALLSTATE Connected or Disconnected whenever the line serviced by the Virtual Session becomes active (caller calls the system) or becomes inactive (call hangs up) .
The Virtual Session executes the GetReadyForNewSession ( ) method to reinitialize the context of the system to the login script whenever Telephony Monitor 240 notifies the Virtual Session that a new call has been connected on its line. b) The user may issue a command in some context which directs the system to go to another context. For example a script might contain the following exchange: when send reply, ! send reply, ! reply to message {
Load (reply. scp) } If the users says "send reply", the system decodes the utterance and calls PhraseFinish with the appropriate Rec_Phrase, (i.e., "send reply") as illustrated in steps 610 and 612. The system then sets an Event, Hevntscriptwhen, in step 614. Processing then passes to step 618 (through block 616 as illustrated in Figure 6) where the event is caught by the RunScript thread.
The RunScript thread is launched at Virtual Session initialization time and is running in parallel with the Virtual Session as illustrated in Figure 2. If RunScript Proc 618 determines that a "When" Exchange has occurred in step 620, CmdResolver method ExCmd is called in step 624 with the Rec_Phrase . If a "When" Exchange has not occurred, RunScript Proc 618 looks for an other even in step 622. Method ExCmd 624 determines an index of the exchange via the Rec_Phrase as specified above with reference to the Determination cf Exchange Through Recognized Phrase.
Once a valid index, Pexch, to the exchange has been determined in step 626, Command Resolver, calls the method CmdLoo . CmdLoop, Command Loop, determines how to execute the command in accordance with the operation of the Command Resolver as described herein. If this is a simple command, (i.e., not a compound nested command) CmdLoop will call the Resolver method HandleAction in step 628 for each Action in the Exchange. All Actions in an Exchange are members of a linked list. If HandleAction 628 determines the command is a Load in step 630, it captures its argument which is the name of the script to be loaded (the new context) . If HandleAction 628 determines that the command is not a Load in step 630, HandleAction 628 looks for an other action in step 632.
In step 634, the script index is stored in Command Resolver
"current_context_" and the script name is stored in SD "script_name" in step 636. The RunScript thread then sends the Event Hevntscriptmain in step 638, and processing of this stage ends at step 640. From step 638, Event Hevntscriptmain is caught by RunScript Proc in step 710 of Figure 7. The RunScript will then Proc decode Event Hevntscriptmain in step 712 and call the Command Resolver method InitNewContext in step 716. If the Event is not HevntScriptMain from step 712, RunScript Proc will look for another event in step 714.
Further flow of control for InitNewContext is illustrated in Figure 7. InitNewContext resets the following Virtual Session state flags in step 718: a) WaitForDTMF = F. This flag signals that the system is not waiting for terminated DTMF. b) Getting _digit = F. This flag signals that the system is not waiting for a DTMF digit outside an Exchange context, i.e., in the context of the script as opposed to that of an Exchange . c) ReadDTMF = F. This flag signals that the system is not waiting to read DTMF into a variable inside an Exchange. d) InScript = T. This flag signals that the system is currently inside an Exchange, the Main Exchange in this case .
InitNewContext then stops ASR engine 510 in step 710. InitNewContext calls ASRLoad in step 722 with the new script index, current_context , then loads the current Exchange pointer, Pexch with the address of the Main Exchange for the new context (script) in step 724. The pointer to the Main Exchange is found in the Context Class via the member "Pexch commands" as given above in the section relating to Context Data Structures. The Main Exchange is explained above in the section relating to the MultiServer Scripting Language. The Main Exchange is the default Exchange which is executed whenever a new context is entered. InitNewContext then calls CmdLoop in step 726 which processes each of the Actions in the Main Exchange of the new script. Since "flow control" in the script interpreter permits other Actions to occur while the TTS is still speaking, a WaitForTTSStopTalking is issued in step 728 since the system might come out of CmdLoop while the TTS is still talking. WaitForTTSStopTalking step 728 will block until the TTS stops, at this point the Main Exchange will have initialized the new context and the InScript flag is set to False at step 732. In step 732, ASR engine 510 is started and processing of this routine ends at step 734.
The use of Embedded Grammar Tokens in the preferred embodiment is significant and will now be described in more detail. The dialog system uses embedded grammar tokens in command phrases for two reasons : a) As wild cards to append special sub-grammars to script command phrases . For example in the login script connected digits may be used as sub-grammars introduced into the command phrases as tokens, as the exact length of a pin code may not be known before entry (pin numbers may be between 7 and 16 digits) . Thus, the recursive nature of embedded sub-grammars is an efficient way to introduce variable grammars . b) As a way of introducing spontaneous, application related data, into command phrases. For example, in the e-mail application each client (of the system) has a personal profile on the web server. In the personal profile resides client specific data such as "Names to forward messages to", "message replies to send", "personal Rolodex", and the like. Grammar Tokens are a mechanism to get data from the outside world into the command phrases (our recognizer is constrained by grammars) .
The architecture involved in the Virtual Session for Embedded Grammar Tokens is illustrated in Figures 8 and 9. Figure 8 illustrates the flow of control which occurs when the system initializes Static Tokens. Static Tokens are always initialized in the Main Exchange of a context as illustrated in step 810. Static Tokens correspond to case (a) above. When the system Loads a new context in step 810 the flow of control is as described previously. If the Script Interpreter finds static tokens in the Main Exchange in step 812, token processing proceeds as illustrated in Figure 8.
HandleAction 814 finds a static token definition, by parsing a regular expression like the following:
<digit s > -- 0 | l i 2 | 3 ! 4 | 5 | 6 | 7 | 8 | 9 ; Static Token definitions are always encased in angle brackets, <>. At Virtual Session initialization a Token array is reserved in the Command Resolver member our_grammar_token_list_:
stl : :vector<UTToken*> *our_grammar_token_list_ [SPTPMOD] ;
Every time a new context is initialized in step 810, the system searches, in steps 816, 820, and 822, the list of tokens it has compiled at initialization time from all the available contexts and fulfills the token rule definition given in the Main Exchange in step 824. From this list, it forms a new grammar rule in step 826 fulfilling the token part of the grammar.
This token rule is loaded into ASR engine 310 via a call to the Command Resolver method "AddToGrammar" in step 826 without releasing the main context grammar (the "main" context grammar is defined as the grammar in which the token anchors are embedded) or any other token rules that the Main Exchange might specify. AddToGrammar accomplishes this by posting a windows message in step 828 to the VS. "Our_grammar_token" member, containing all information both on the token name and its associated grammar definition, stores all active tokens in order to facilitate matching between spoken phrases and context phrases containing tokens. If a static token definition for the token does not exist, processing passes to step 818.
Dynamic token definitions are updated via "our_grammar_token" in the Command Resolver method HandleAction during command resolution processing, as illustrated in step 930 in Figure 9. For the second type of tokens (b) , dynamic tokens, the Virtual Session
Grammar Object has two template lists of Grammar Token objects:
stl : : vector<GraramarToken*> grammar_tokens_; stl: : vector<GrammarToken*> grammar_tokens_free_list_;
The first list corresponds to the active list of grammar tokens and the second corresponds to the inactive list . Already existing grammar token objects are not deleted in order not to fragment memory. Since these tokens are dynamic, (i.e., their definitions change constantly) , creating and deleting dynamic tokens would be a burden on the OS memory manager. The Grammar Token Object has the following members:
class GrammarToken
{ private :
PISRGRAMCOMMON gram_; UTString name
The "name" member allows string manipulation on the token name so that in making comparisons to active tokens, the system may determine whether a token is already active with an obsolete definition or should be newly instated (See Figure 10) . The "gram" member is a pointer to the ISGRAMCOMMON (this interface is specified in the Microsoft SAPI specification) interface to the ASR engine which allows the grammar rule corresponding to the token to be Activated, Deactivated, or Released. The flow of control for spontaneous loading of applications-related grammar rules is illustrated in Figure 9. Figure 10 is a flow diagram illustrating the operation of the AddToGrammar and SponLoad methods .
The method for resolution of spoken phrases used in the preferred embodiment will now be explained in more detail in connection with figures 11a and lib. In step 1110, when ASR engine 310 produces a result (a spoken utterance) it notifies the Virtual Session via a call to the PhraseFinish member of the ASR Engine class belonging to that particular VS. If the result is greater than threshold in step (as described previously) , the Rec_Phrse is passed to the method FeedFromASR where the flag "Listening_To_ASR" is set False after deciding to process the result. In step 1112, "Listening__to_ASR" is used by the CR, in FeedFromASR, to determine whether or not to process the result. If the flag "Listening_to_ASR" is set True, processing ends at step 1114.
In step 1116, FeedFromASR sends the windows message Hevntscriptwhen which will be caught by the RunScript thread in steps 1118 and 1120, and resolved so as to execute the Resolver method ExCmd in step 1124.
In step 1136, the context is examined to determine whether it includes embedded tokens. If the context does not include embedded tokens, the Rec_Phrse is compared to the Context Phrases in step 1142 to find a match.
If the Rec_Phrse is equal to the Context_Phrase, in step 1142, processing passes to step 1144. Once a match is found the corresponding Exchange is found by taking the index of the matched phrase, in step 1144, call it "i" and producing the index = index [i] in the Command Resolver method FindCmdExchange in the manner described above in the section defining context data structures. Processing then ends at step 1146.
If the Rec_Phrse is not equal to the Context_Phrase, in step 1142, processing passes to step 1148. In step 1148 a determination is made whether this is the last phrase. If not, processing passes to the next phrase in step 1150 and processing returns to step 1142. If it is the last phrase, the index is decremented and processing then ends at step 1154.
If there are embedded token phrases in the current context, processing proceeds to step 1138, where a token count and token names in context phrases are retrieved. Processing then passes to Figure lib via block 1140. The tokens are expanded in step 1156 according to their definition in the Resolver member our_grammer_token_list_, which is a list of all grammar tokens known in a particular script. The expanded token versions of context phrases containing embedded tokens are then compared to Rec_Phrs in step 1158.
Comparisons are made on small case versions of phrases . Once a match is found the corresponding Exchange is found by taking the index of the matched phrase, in step 1160, call it "i" and producing the index = index [i] in the Command Resolver method FindCmdExchange in the manner described above in the section defining context data structures. Processing then ends at step 1162.
If a match is not found in step 1158, a search is made for more context phrases in step 1166. If more phrases are found in step 1168, processing returns to step 1158 above. If no further context phrases are found, the index is decremented and processing ends at step 1162.
A Command Resolver, CR, is associated with each VS, and the operation of this Command Resolver in the preferred embodiment will now be described in more detail. The Command Resolver takes a recognized utterance and performs the following actions:
a) Determines the Exchange index associated with the Recognized Phrase, Rec_Phrs . b) Executes the Actions, Subactions, and Loops which many comprise the exchange which is matched to the Rec_Phrs .
Virtual Session creation is governed by system startup, as illustrated by step 1310 in Figure 13a. The Command Resolver initializes a Virtual Session by reading the ".ini" file, DIF (Dot Ini File), associated with the particular VS. DIF's and VS ' s are correlated by a Master Configuration File, MCF, for Telephony Voice Server 310. Thus, Telephony Voice Server 310, monitoring 24 different telephone lines, may run 24 different DIF's which are correlated to the 24 VS's monitoring those lines via the MCF in step 1312. The MCF looks like this:
VS1 filel.ini VS2 file2.ini
.VS24 file24.ini
The contents of the DIF was described previously in the description of Context Data Structures. When a Virtual Session initializes it creates a Command Resolver object and a RunScript thread. The Command Resolver object is initialized by parsing the script files in the DIF associated with it via the MCF. The Command Resolver initialization creates and populates an array of Context objects associated with each script. A map of class objects associated with the Command Resolver is illustrated in Figure 3. For example, if there are 10 script files in a DIF, then an array of 10 context objects is created for that VS. They are indexed via the Command Resolver members : int current_context_;
Context *m_Pcmdcontext [SPTPMOD] ;
UINT m_numContexts ;
The "m_Pcmdcontext " member is an array of pointers to context objects corresponding to different scripts.
Once a spoken utterance is decoded by the ASR or a new context is to be initialized, NT events are sent by the Virtual Session to RunScript the thread. The Command Resolver runs in the context of the RunScript thread, as "flow control" is imposed on the script interpreter. Flow control means that for certain actions the script interpreter may block. These actions fall into two categories :
a) Actions which involve the TTS talking or wavefiles playing. The system speaking to the user is a sequential operation. At most the system may be speaking and another speaking action may be queued, but they can't occur at the same time. Thus, the script interpreter must block (playing the next speaking action) until the first one has finished. b) Actions which return data required to proceed. This typically involves interactions with the application. Since Telephony Voice Server 310 is linked to the application (e-mail for instance) via a network, a finite time is required to receive a response to an application query. During this time the script interpreter must block pending the receipt of the required information.
The flow control is managed by the Command Resolver method "BlockTillReady" . BlockTillReady takes an argument which is an enumerated type. The argument tells the function what NT event it should receive so as to stop blocking. The argument may be any of the following:
EVT_WAKE_UP : General stop blocking event .
EVT_TTS : Sent when the TTS is finished talking.
EVT_DTMF : Sent when DTMF is in. Resets "safe_to_plink" and "waiting_for_DTMF flags.
EVT_ASR : Sent when ASR has finished loading grammar. Reset ASR_is_loading_grammar flag.
SYSTEM_KILL : Sent when system is coming down. EVT TTS ABRT : Sent to not block on queued system speaking since a queued system speak has been aborted.
BlockTillReady receives an event, checks to see if the event corresponds to the one it was programmed to expect, and stops blocking if it finds a match. Depending on the event received, BlockTillReady also sets Command Resolver state flags to the appropriate state as illustrated above.
The basic task of the Command Resolver is to execute the Actions contained in the Exchange correlated to the spoken phrase. As illustrated in Figures 11a and lib, the Resolver first determines the correct Exchange, then it passes control to CmdLoop. The flow of control for CmdLoop is illustrated in Figures 12a and 12b. CmdLoop 1210 gets the first Action in the Exchange from the Pexch pointer in step 1212.
CmdLoop 1210 checks this Action to determine whether it is simple action or a Loop in step 1214. If the Action is a Loop in step 1216, the Action mnemonic is; Loop (Argument ) . The structure of the Loop Action is as follows:
Loop ($count )
{
The scope of the Loop is delineated by the curly brackets. CmdLoop keeps track of the Loop argument and executes each of the instructions within the scope of the Loop as many times as the argument provides in step 1220.
In step 1230, CmdLoop determines whether the action is a subaction. A subaction is one which is delineated by a pair of curly brackets and whose initial action was either a Loop or a Conditional. Conditionals are Actions which execute a set of Actions dependent on the result of some condition being True or False. An example of a conditional is as follows:
$result .True :
{ Say You would like to select message $result
GetSay (ptext ) =mail . SeleetNewMessage ($resuit) $continue=False
}
The variable $result is tested to see whether it is true, if it is the three actions within the scope of the conditional are executed. If the action is a Loop, CmdLoop 1210 determines the number of times the loop should be executed by loading the variable nloop in step 1216. Otherwise nloop is set to 1 in step 1218 since the current Action should be executed once. Next CmdLoop calls HandleAction 1220. HandleAction is the Command Resolver method which parses the Action and its arguments then executes the Action. A list of Actions and what they do is presented below with reference to Figure 15. If the Action caused HandleAction to return a change-of-context flag in step 1222 (NEWCONTEXT=T, i.e., Action = Load (newcontext) ) CmdLoop terminates in step 1224.
If the command was a Loop command, HandleAction returns since this Action is handled in CmdLoop. The loop variable "nloop" is compared to zero in step 1228 to determine whether CmdLoop should proceed to the next action. If nloop is not zero, CmdLoop determines whether there is a Subaction associated with the present
Action in step 1230 (i.e., a set of curly brackets) . If there is,
CmdLoop 1210 becomes recursive onto itself and calls CmdLoop 1234
(within CmdLoop) to execute the scope of the Subaction, which may be nested as deeply as the script writer wishes.
CmdLoop proceeds to execute the Loop until nloop has beer- decremented (in step 1242) to zero (determined at step 1228), at which point CmdLoop either executes the next action in the outermost scope or the next action is a Null. If the next action is a Null, as determined in step 1232, CmdLoop terminates in step 1240.
HandleAction is the Command Resolver method which parses and executes the current Action. Applications commands handled by HandleAction are those which do not require interaction with the applications interface. For example, commands such as: Load, WaitForDTMF, Say, and the like do not require interaction with the application. Commands which do, such as: NextMessage, PreviousMessage, and the like are executed by ProcessAppAction . ProcessAppAction communicates with the application via full duplex, message mode Named pipes . There is an applications pipe for each VS. ProcessAppAction parses and formats commands for the applications interface, sends the request, and buffers the response so that it becomes available to the Virtual Session via the Session Data buffer Fdbufout . The flow of control for the Command Resolver is illustrated in Figure 13b.
The DTMF IVR system functions will now be described in more detail . Many of the functions available to users of Telephony Voice Server 310 in a particular context are mirrored by programmable DTMF keys. Telephony Voice Server 310 offers 11 programmable keys per context including the * key. The # key is reserved for a TTS or digital file play interrupt . Examples of a keypad key being mapped for IVR functions are:
when #3 , next message {
$direction= ' next '
GetSay (ptext ) =mail .NextMessage ( )
} when !## {
$temp=$_DTMF_Result If ($temp='Null' ) .true:
{
Say You \ ! - did \ ! - not \ ! - enter an account number ! To use your account, say your account number, one digit at a time or enter it from the keypad followed by the pound sign
}
If ($temp='Null' ) .false: {
$have_pin_number=mail . PinNumber ( $_DTMF_Result ) $have_pin_number . true :
{
Load (GetUserKey. scp) }
$have_pin_number. false : Say The account number you entered is not valid. Please reenter your account number followed by the pound sign;
These two examples demonstrate the how keypad keys may be programmed as an alternate to voice commands in performing various exchanges. In the first example, keypad 3 is programmed as an alternative to the user saying "next message", in the second example entering a sequence of digits greater than one digit and terminated by the # sign defaults to the second exchange for which there is no voice alternative. This instance of IVR mapping is used as a way of entering a string of digits as data, such as a Pin Number. The Command Resolver treats voice commands and IVR maps in the same way. It associates exchanges with DTMF keys in the same way that it associates voiced phrases with an exchange index as explained above with reference to Context Data Structures .
Definitions
1) Context of the script. For example; email2.scp, forward. scp ... This denotes being in the context of a script proper, the system is in the "listen state" and all commands associated with that script are available.
2) Context of the Exchange. All commands (in order to be useful) have an associated Exchange (whether they are DTMF, i.e., ##, or "grammar" grammars). The scope of the action is denoted by the enclosing curly brackets following the command definition (when word(s) or when ##) . For example; when yes
{do action} .
DTMF behavior implemented in the system is described below:
1) Context of Script or Exchange. Single DTMF within a 1.2sec window is correlated with an IVR map.
2) Context of Script. Multiple DTMF are mapped to an Exchange associated with the DTMF command line "when ##" .
For example one may input a Pin Number without saying anything . 3) Context of Exchange. ReadDTMF is a multi-character input function which maps a DTMF string to the value of a variable . For example
$result=ReadDTMF .
4) Context of Exchange. WaitForDTMF allows multiple DTMF entry terminated by the # sign. The system waits for DTMF entry, a timer in this case defines an error condition, i.e., user has failed to enter DTMF.
5) # Means STOP TTS FROM TALKING.
Flow of control for IVR mapping is illustrated in Figures 14a,
14b, and 14c. Each digit as it is entered is recovered via
Telephony Monitor 240 thread as a TAPI notification. The system uses a TAPI generated event to signal a response. The following TAPI messages are decoded:
LineReply
LineCallState
LineGenerateDigits LineMonitorDigits LineMonitorTone
When DTMF digits are entered, the TAPI event interrupts Telephony Monitor 240 and resolves to LineMonitorDigits. LineMonitorDigits function records the digits in Session Data (SD) , then according to the particular LinelD associated with the event LineMonitorDigits calls the FeedFromDTMF function associated with the Command Resolver created in the context of the Virtual Session associated with the current LinelD. Flow of control for FeedFromDTMF is illustrated in Figure 10a. FeedFromDTMF accumulates the digits as they come in, into the SD variable dtmf_. FeedFromDTMF checks two state flags :
1) ReadyForDTMF . This flag is true if the system is currently looking for a DTMF digit to load into a variable. For instance, $variable = ReadDTMF had been encountered in the current script by HandleAction.
2) WaitForDTMF. The system is looking for a DTMF string to be input and terminated by a # sign.
If the function detects that the current entry is a # sign or WaitForDTMF is true or ReadyForDTMF is true, the Syslnt is set true. This indicates that, if talking, the system should be forced to stop talking. If the current entry is a # sign the safe_to_plink_ flag is set to true to signal the voice system that future mis-recognitions are not DTMF tones. The main function of FeedFromDTMF is to make a determination based on state flags whether the system is in the context of the script or in the context of an Exchange. The biggest difference in the actions of the system, given these two different states, is that for continuous digit entry in the context of a script a "when" Exchange must be mapped to a "##" as described above. The OOE (Out Of Exchange) flag indicates that the current context does support this mapping. In the context of an Exchange WaitForDTMF allows multiple digit entry with a terminator. As illustrated in Figure 14b, if the system is in the context of an Exchange it resolves digit entry for data or flow control from that of IVR mapping. The function DTMFFinished may stuff a script variable with a DTMF value if ReadyForDTMF is set or set the "Heventscriptdtmf " event allowing the RunScript thread to catch and decode an IVR map. If the WaitingForDTMF flag is detected the system calls DTMFDigitsAreAvailable which prompts BlockTillReady to look for a # sign in the current DTMF string. If found the string is made available to the Command Resolver and the interpreter resumes execution of the script. As illustrated in Figure 14c, if the system is in the context of the script, the ASR is stopped and the system checks to see whether the script is mapped to ## when. If not the single digit entry is resolved as an IVR map, if not the system determines whether all the digits are in, in order to satisfy the ## map.
The telephony server scripting language provided as part of the present invention will now be described in more detail.
BeginCriticalSection/EndCriticalSection :
Action This command encapsulates script actions which must be executed irrespective of the present state of the system.
Syntax BeginCriticalSection
Actionl Action2
ActionN
EndCriticalSection
Comment The CriticalSection command was designed to ensure that critical code, typically initialization code, is always executed. Parts of Exchanges may be abort via the actions of the user. For instance the user may decide to IVR map into another Exchange before the present Exchange has had a chance to initialize all of its critical variables . The CriticalSection command insures that the system is always in a known state.
GetSay (ptext) Command;
Action The GetSay (ptext ) command is used to retrieve and speak text strings from the application.
Syntax GetSay (ptext ) ^application , function (argument)
Comments The GetSay (ptext ) command is used with functions provided by the applications interface. Application Interface functions return one of the following:
a) A text string. b) A zero, i.e., 0. Zero means the function has failed.
GoTo Command: Action Allows the system to GoTo an Exchange corresponding to pre-mapped IVR "When statement"
Syntax GoTo #number
Comments This may be used as a facility to link together macro structures of Exchanges so as to smoothly create automatic actions . An example of the GoTo command is as follows :
when #1, next message
GetSay (ptext ) =mail . NextMessage ( ) $remaining=mail . GetnumRemainingNew ( ) WaitForTTS
If ($remaining='0' ) .True: Goto #8 If ($remaining=' 0' ) .False: Goto #1
when #3 , read message
$read=mail . MarkAsRead ( ) GetSay (ptext ) =mail . ReadMessage ( ) WaitForTTS Goto #1 If() Command:
Action The If () command is used to test script variables.
Syntax If ($variable=' value' ) .True (False) :
Comment The If command is used to test variable values and as an anchor for sub-Exchanges . An example of its use is the following:
If ($result=' *' ) .True:
{
GetSay (ptext ) =mail . ReadMessage ( ]
$result=False
$continue=False
Load ( ) Command: Action The Load command allows the system to load another script and jump from the present context to the context specified by the argument of the Load function.
Syntax Load (scriptname . scp)
Comment The scriptname . scp argument must point to a valid script. A valid script is one whose path may be resolved. In the present system all active scripts appear in specific directories given to the system via environment variables. Also, the script must appear in the ".ini" file for the system. This enables the Virtual Session parser to include the script in the system context tree. An example of the Load command is :
when send reply, ! reply to message
Load (reply. scp)
Where "reply. scp" is a valid script
Loo ( ) Command: Action The Loop command executes an Exchange the number of times specified in the argument of the Loop command.
Syntax Loop ($variable (constant) )
Comment The Loop command executes a sub-Exchange the number of times specified in the argument of the command. An example of the Loop command is as follows:
Loop($next)
{
$continue . True :
{
GetSay (ptext) =mail . ReadNewHeader (0) If ($result='*' ) .True:
{
GetSay (ptext ) =mail . ReadNewMessage ( 0 ) $result=False $continue=False }
$result .True :
{
GetSay (ptext ) =mail . ReadNewMessage ( $result ) $continue=False } } }
Main Exchange :
Action The Main Exchange is the default Exchange for a script .
Each script may have only one Main Exchange. A script is not required to have a Main Exchange, however if it has one the Main Exchange is automatically executed as soon as the script is loaded by the system.
Syntax Main { actionl action2
actionN
Comments Upon loading the script the system automatically executes the Main Exchange. Typically the script writer would use this Exchange to: a) Inform the user of the present context. b) Initialize local scripting variables. c) Communicate with the application to initialize variables associated with the application.
Nested Exchanges:
Action nested Exchanges enable the script writer to execute a sub-Exchange within the scope of an Exchange given some appropriate command as an anchor.
Syntax
{ actionl
anchor
{ actionl
anchor actionl
actionP
actionM
actionN
Comments Nested Exchanges may be nested as deeply as the script writer wishes. Anchors for Nested Exchanges may be:
a) Variable test, i.e., $name . true (false) : . b) Loops, i.e., Loop (variable) c) If statement, i.e., If ($name=' value '). True : . Basically, anchors may comprise language structures which allow execution of an Exchange based on the outcome of some test.
PlayWav Command:
Action This command is used in order to play a recorded file,
Syntax PlayWav (file .wav)
Comments PlayWav command may be used to play recorded files. An example of PlayWav is :
when play file, Iplay greeting {
PlayWav (greeting.wav)
}
ReadDTMF Command:
Action The ReadDTMF command is used to alert the system that logic in the current Exchange requires non-blocking keypad entry. ReadDTMF is equated to a script variable which holds the value of the next keypad entry the user makes .
Syntax $result=ReadDTMF
Comments ReadDTMF function is used for any Exchange which requires non-blocking DTMF entry. The advantage of this function is that it works with a script variable thus all the operations which a script variable allows is available to it. For example:
$result .True:
{
Say You would like to select message $result GetSay (ptext) =mail . SelectNewMessage ($result )
$continue=False
}
In this example the variable result is initialized upon keypad entry. At which point the script may manipulate the result of keypad entry to :
a) Used as a logical variable to be tested. b) Insert the result into a text string to be spoken. Used as an argument to an application function.
Say Command:
Action The Say command spools text to the TTS, thus enabling the system to speak whatever text is contained in the argument of the Say command.
Syntax Say text argument
Comments The text argument above is a text string which the script writer fills in. It may either be a constant string or it may be a string with imbedded script variables . For example :
&&& $forcast = 'bright and sunny' a) Say the weather will be bright and sunny. b) Say the weather will be $forcast.
In both cases above the system will speak the text string following the Say command. Script Variables :
Action Script Variables may be defined by the script writer within the context of a script . Script Variables have a content value and a Boolean value .
Syntax $name
Comments The $ denotes a script variable. The "name" of the script variable may be any number of alphanumeric characters up to 20 characters in length. The name of the variable may contain underscores, i.e., $welcome__complete, $old_messages and the like. Each script variable has the scope of script, even if a script variable is defined in the context of an Exchange its meaning is valid throughout the script . Each script variable has two values : a) Content value. The value of a literal string. b) Boolean value. If the variable has been initialized via its content value, its Boolean value is True, if the content value has yet to be initialized its Boolean value is False. Script variables may be equated to other script variables, i.e., $valiablel=$variable2. Script variables may be tested to determine whether they are initialized or not:
$welcome . true :
{ actionl
actionN
} $welcome . false :
{ actionl
actionN
In the examples above a script variable is being tested for its state of initialization, and depending on its current state a sub-Exchange of actions is being executed. The syntax for testing a variable and executing a scope of actions dependent on the results of that test is:
$name . true (false) :
{ actionl
actionN }
Script variables may also be inserted into text strings, i.e., :
$weather_forcast = 'hot and sunny'
$forecast = 'Todays weather will be' $weather_forcast .
Then $forcast = 'Todays weather will be hot and sunny' .
Terminate Command: Action The Terminate command may be called in a script when the user has indicated they wish to terminate the current session. It reinitializes the Virtual Session to accept the next phone call.
Syntax Terminate
Comment An example of the Terminate command is when the user says "goodbye", i.e., the appropriate Exchange is as follows:
when goodbye
{
GetSay (ptext ) =mail . GoodBye Terminate
Token Variables :
Action Token variables are variables which may become grammar rules. Token variables may be manipulated like script variables, i.e. , they may be equated to script variables, they may be inserted into a text string, they have a Boolean value of True or False, depending on their state of initialization.
Syntax <name> = grammar.
Comments Token variables may be associated with script dependent grammars or may represent additions to script grammars whose origin is the application. This is a way of incorporating into the script grammar information accessed through the application. The origin of this information may be databases, the network, Pirns, and the like. Token variables may also appear in Main Exchanges as script specific sub-grammars:
<digits>=l|2 J3 |4 J5|6J7|8J9;
Token variables appear in When Statements as embedded sub- grammars :
when please read message <!digits>+, ! read message
<digits>+
{
Say Getting new message $<digits>
GetSay (ptext) =mail . SelectNewMessage ($<digits>) In this example the Token variable <digits> is used in the grammar to include the user saying any combination of the constant grammar on the "When line" and an arbitrary string of connected digits specified by <digits>. <digit>+ means any number of digits specified by the definition of <digits>. In this case the definition of <digits> is read; one or two or three or four ...or nine. The exclamation in "<!digits> means that in the help system the system should speak the word "digits" in the command "please read message digit" instead of inserting the definition of the Token. In the Exchange, for the action "Say Getting new message $<digits>", the system inserts the Token variable $<digits> with the string corresponding to numbers which the user has spoken. For example if the Exchange was executed because the user has said "please read message five one" the Token variable inserted into "Getting new message" is "five one", thus the system says to the user "Getting new message five one" .
WaitForDTMF Command:
Action The WaitForDTMF command is used when exchange logic requires blocking, terminated DTMF entry. Syntax WaitForDTMF
Comments WaitForDTMF interrupts the execution of a script to wait for DTMF entry. This is typically used when the system requires keypad entry to continue, as in the case of a
Pin Number. The user terminates keypad entry with the # key. Upon termination the system resumes execution of the script and makes the keypad data available to the system. If no entry is detected within a timeout time set via "System. Cmd. SetTimeout ( timeout)", the system aborts the Exchange. The following Exchange is an example of WaitForDTMF:
Loop (3 ) {
WaitForDTMF
$result=mail . PinNumber ( $_DTMF_Result )
$result . true :Load (reply. scp)
$result . False : Say Try again to enter your pin number; }
WaitForTTS Command: Action The WaitForTTS command is used in order to impose flow control on the execution of the script . Since the execution of the script does not necessarily block during the time the TTS is speaking, the script writer may impose this constraint in certain instances.
Syntax WaitForTTS
Comments Used for flow control of the script
When Statement :
Action The When Statement associates spoken phrases or Keypad maps or special directives with sets of actions grouped into user exchanges .
Syntax when phrasel, !phrase2, ..., #number, !phrase3
{ actionl action2
actionN or
when ##
{ actionl action2
actionN
or when #timeout#
actionl action2
actionN
Comments The first instance of the When statement above associates phrases, one through three, with the Exchange which follows it . The Exchange is defined by the group of actions following the When line and is delineated by the outermost curly brackets enveloping the actions. The exclamation marks preceding phrases two and three exclude these phrases from the automated help system. All phrases without preceding exclamation marks are included in an automated help system invoked by the function "SayHelpCommands" (see SayHelpCommands) . The sequence "#number" in the first When statement denotes an IVR map to the keypad number "number" . Numbers preceded by the
# sign flag an IVR mapping between the When statement's Exchange and the keypad number following the # sign. The double pound sign in the second When statement, ##, denotes an association between keypad entry of a string of numbers and the Exchange associated with the corresponding When statement. For example, the "When ##
{ } statement may be used to enter a Pin Number without the user having to say anything. The last When statement maps a special variable to an Exchange . The variable "$_Timeout_String" may be mapped to an Exchange, namely the "When #timeout# {....}" Exchange. This may be used in conjunction with a script programmable timer to take some action if a time period is exceeded without user action. Figure 15 is a state transition diagram for the Telephony Server system process according to the preferred embodiment of the invention. Each Virtual Session in the system has voice resources, play/record facilities, and a Command Resolver. The functional interrelation between these elements is illustrated in the Virtual Session system flow diagram. Referring to Figure 15, the state changes, as denoted by numbers in the flow diagram, are defined as follows:
Transition 1 Non-recognition go from listen to Beep to alerting the user.
Transition 2 Beep finishes playing, upon completion go to
Listen.
Transition 3 Probability of Recognition less than threshold Beep to alert user.
Transition 4 : Recognition event check threshold. Transition 5 : Recognition above threshold go to Resolver. Transition 6: Good Exchange index, go to CmdLoop. Transition 7 : Exchange processed go to Listen state. Transition 8: Exchange contains nested exchanges . Process nested exchange .
Transition 9 : Action parsed in exchange. Handle Action. Transition 10 Action parsed in nested exchange. Handle Action. Transition 11 Action queues the system to speak. Transition 12: Stop speaking notification detected by system. Transition 13 : After TTS has completed go to next action in the exchange . Transition 14 : Handle Action has determined that this action requires communication with application. Actions of the form "application. function" require communication with the application. Transition 15 : Action requires addition of new grammar via grammar tokens . Transition 16 : TTS has stop and there are no more actions in exchange . Check system state . Transition 17: System state permits transition to Listen state. Transition 18: Context switching complete. Context initialized go to Listen state. Transition 19: Handle Action has found a Load command within the current exchange. System transitions to new context .
The electronic mail services provided in the preferred embodiment will now be described with reference to Figure 19. The services provided are based on Internet mail standards . Simple Mail Transport Protocol (SMTP) is used to exchange messages between Internet mail servers. Post Office Protocol 3 (POP3) is utilized by Internet mail clients to retrieve messages. The system implements each protocol, allowing it to receive and/or retrieve Internet e-mail messages for users. Users retrieve messages through a telephone interface.
As illustrated in Figure 19, the e-mail system comprises five primary components. Message Polling subsystem 1910 retrieves e- mail messages using POP3. Message Receiving subsystem 1912 receives messages from SMTP servers. Message Delivery subsystem 1914 processes and stores messages in the Mylnbox system 1916. Message Sender subsystem 1918 formats and send (via SMTP) outgoing replies and forwards. Web Service 1920 provides user personal profile maintenance and system administrative tools.
The diagram in Figure 16 illustrates the relationships between the components of the message polling subsystem. The Polling Subsystem actively retrieves messages by establishing POP3 connections to the user's electronic mail system. Available message are checked against the list of messages retrieved during previous sessions. Those messages identified as new are copied into the system.
Polling subsystem 1910 may comprise two components, Account Scheduler 1610 and Message Poller 1612. Generally, processing proceeds as follows: 1. The Poller requests an account from the Account Scheduler
2. The Scheduler selects an account from the database and returns it to the Poller.
3. The Poller attempts to establish a connection with the user's POP3 server. If successful, the Poller logs in, using credentials provided by the user during sign-up.
4. A list of available messages is retrieved and compared with those known to have been downloaded in a previous session. New messages are downloaded and processed by the Message Delivery Agent .
Figure 17 illustrates the relationships between the components of message receiving subsystem 1912. Message Receiving Subsystem 1919 receives messages sent to user's account via SMTP server 1710. Messages enter the system through a program called Metalnfo Sendmail 1712, an implementation of the industry standard SMTP server. Sendmail in turn invokes the Message Receiver's remaining components, Uagent program 1714 and Message Handler 1716. Generally, processing proceeds as follows:
An external SMTP server connects to the sendmail server and transmits a message.
Sendmail invokes Uagent, a specific implementation of a local delivery agent, or LDA. The LDA' s responsibility is to deliver messages to a local user and indicate to sendmail whether the operation completed with or without errors . 3. Uagent in turn locates an Message Handler instance, reads the message, and hands-off delivers it to the Handler for further processing.
Message Delivery Agent 1914 process messages, storing summary information and text-to-speech translations in Oracle database 122. Complete message contents are inserted into a file system based message store. Message Delivery Agent 1914 is not a free standing program, but an object component used by both inbound message processing subsystems. Its functions include:
1. Authenticating users
2. Reading user specific database information
3. Parsing message headers and bodies
4. Applying exclusion and priority filters
5. Message analysis 6. Performing text-to-speech translations of the primary message body
7. Storing message data into the database
8. Storing complete message texts into a message store
11 - Referring now to Figure 18, Message Sender 1810 is responsible for the preparation and delivery of user-created reply and forward messages. In rather simple fashion, Message Sender 1810 monitors a queue of outgoing messages. As outgoing messages are discovered, messages are removed from the queue, prepared for delivery by sendmail 1812, and transmitted through SMTP server 1814. Generally, the processing steps are as follows:
1. A Sender monitors the outgoing message queue for new forwards and replies
2. The message is read, merged with user specific information, and formatted for delivery
3. The sendmail server is contacted for actual message delivery.
The Java Web Server, while not directly involved with the receipt, processing or delivery of messages, hosts several critical interfaces. The overwhelming majority of these interfaces are implemented with the Java Servlet API .
End-user functionality includes registration, P0P3 account configuration, exclude and priority filters, predefined responses, and the personal directory. Administrative interfaces include usage reports, corporate account management, server configuration, and service monitoring and control.
Figure 20 illustrates a process for creating or maintaining a user profile using a web-based interface. In the first step of this process, in Block 2002, the user accesses the server using an industry standard web browser from any Internet-connected computer. The user identifies his account and enters a passcode to obtain access to his individual profile, as illustrated in Block 2004. The user may then, as illustrated in Block 2006, enter personal directory information. This information may include at least the first name, last name, and e-mail address of persons to whom e-mail messages may be regularly forwarded. If the name entered in the personal directory is difficult to pronounce, it is useful to spell the name phonetically or to use a nickname instead of a first and last name. In enhanced embodiments of the invention, the personal directory may include other information such as telephone numbers .
The user may also, as illustrated in Block 2008, create and edit personalized, pre-set standard reply messages. Any number of these messages may be created and they may be updated at will . The information entered includes a reply message name by which the reply message will be specified in the voice control mode. In addition, a personalized message is entered. For example, the reply message name "Thanks" might be associated with the message "Thanks for the e-mail, I heard it while driving home and will get back to you . "
A message priority list may also be created in the user profile, as illustrated in Block 2010. The user may enter any of the following in corresponding data fields: sender name, sender e- mail address, sender domain, subject line text keywords, and message body text keywords. In operation, if any of these fields match the corresponding characteristics of an incoming e-mail, that e-mail will be designated for priority delivery and will be delivered by voice e-mail before those messages not enjoying similar priority.
Similarly, an exclude message list may be created and edited using the web browser interface (as illustrated in Block 2012) . Messages may be excluded by sender name, sender e-mail address, sender domain, or subject line text. Finally, account information may be reviewed, modified, and accounts cancelled if desired using the web browser personal profile interface (Block 2014) .
The present invention may be provided in a number of different embodiments, each of which may include various modifications and additional features. One particularly significant feature of the preferred embodiment is web-based user profile entry. This feature permits the user to access his or her profile from any location using Internet 130, thereby customizing the operating of the user's account at will. The user profile may include personal address lists, preferences with respect to the order in which e-mail messages are read during access (such as identifying particular senders for priority handling or which should not be read over the telephone—e . g. newsletters), form e-mail replies which are individualized for the particular user, names and keywords which are likely to be spoken by the user during mail retrieval. Where the personal address lists include an entry and a telephone number associated with that entry, a voice dialing feature may be provided in which the voice command "dial <<name>>" causes placement of a telephone call to <<name>> from the personal address list.
The system may conduct searches in response to a voice command, based on the stored personal profile. For example, a search-for-sender function may be provided. ("read me the messages in my mailbox from Bill Clinton") . As another example, when a list of search keywords has been provided in the user profile, the system will load those keywords as vocabulary where appropriate, so that (for example) the user may request that mail including those keywords be read. For example, if "purchase order" is a keyword defined identified in the personal profile, the user may ask the system to "read me messages with subject: purchase order," and the system will recognize the words "purchase order" and select those messages including the keywords as specified.
In one preferred embodiment, mail preprocessing is provided. The mail preprocessing feature uses a table correlating certain symbol strings with other words. When the mail is processed, predetermined symbols or series of symbols are replaced by predetermined words before the mail is "read" to the user. The equivalence table provides full equivalent phrases as replacements for commonly used acronyms, and provides aurally recognizable equivalent words or phrases as replacements for "emoticons." For example, ";)" may be replaced by "wink."
The system applies several unique features to increase processing speed. When processing verbal commands, the system loads a predetermined limited vocabulary which is context - appropriate to the function being performed. In this way, the system need only compare the user's spoken command to a limited number of possible vocabulary words or phrases to identify the intended command. Then, the system compares text strings rather than comparing recorded files while processing verbal commands. The use of dynamically loaded grammars and the preprocessing of records while other records are streaming each increase response speed.
The system uses a prompt when it is ready to receive a voice command. In a preferred embodiment, this prompt is a "plink" sound. Failure to recognize the user's command as one of the current vocabulary items is indicated by a different prompt, such as a double plink.
In another preferred embodiment, the system is provided with specific methods of translating visual cues into audible cues in cases where an e-mail message includes such cues. For example, HTML pages contain a variety of formatting, including positioning, graphical features, and variations in text appearance. Bold text, bullets, and other formatting may also be included in any message. In the preferred embodiment, a standardized library of sounds, tones, words, changes in voice timbre, and other audible indicators are used as the message is read, in place of the formatting, to reflect visual presentation which is important to a full appreciation of the message, yet which would not otherwise be conveyed in a purely audible transmission.
The system preferably incorporates special methods for relaying a threaded e-mail to a user. The term threaded e-mail refers to a message which is a forward of or reply to one or more messages and incorporates those previous messages in its text. Threaded e-mail may be identified, and individual messages within the e-mail may be parsed, by processing of message headers included in the e-mail text, counting leading > symbols placed before the message by e-mail clients, and other methods which take into account and process the format imposed on threaded messages by various e-mail clients.
When a threaded e-mail is identified, aurally recognizable clues to the organization of the messages may be supplied as the e- mail is read, such as "Bob said" ... "you replied with" ... "Jim said"... and the like. Alternatively, other auditory based differentiation may be used, such as using different speaking voices or pitches for different writers. In additional to identifying the individual responsible for each part of the e-mail, the messages making up the e-mail may then be read or not read, selectively, based on the stored user profile. In cases where a reply is interleaved with an original message, it may be desirable to read the entire message, with some identification for different sections, such as "you said" and "Bob said." As another option, the user may be dynamically prompted to decide whether he or she wishes to hear an original or forwarded message which is part of an incoming e-mail. Electronic "gisting" technology, in which an expert system is applied to a passage to automatically summarize it, may be applied to e-mails exceeding a predetermined length. Then, only a summarized or "bullet point" version of the message is provided in the first read. The user is preferably informed that the message has been gisted, and provided with the option of hearing the entire message.
When the user provides a spoken command during reading of e- mail, the system selectively responds to the command to either stop reading or continue reading, while implementing the command. The stop/no stop operation is determined both by context and by the nature of the command. For example, the system does not stop reading on receipt of a "speak louder" or "speak faster" command, but stops in response to "send reply."
Preferably, the noise floor on the telephone line connecting the user to the system is detected, and the recognition threshold of the voice recognition engine is changed dynamically based on the level of the noise floor. If the noise level is high, a higher level of certainty may be required before recognition of a command occurs. If the noise level is low, it may be possible to recognize a command with a lower level of certainty. Where the voice recognition engine operates based on a hidden Markov model, the depth of the Markov "tree" may be changed dynamically based on the noise level to achieve changes in the recognition threshold. In particular, the tree depth may be increased in the presence of more noise, and reduced in the presence of less noise.
Polling of e-mail addresses supplied by users preferably occurs adaptively. That is, users who historically receive a high volume of e-mail, or a high volume during particular periods in the day, will have their mailboxes polled more often (or more often during historical peak times) than users typically receiving a low volume of mail. For example, business users may receive little e- mail during the evening, while home users may receive more of their e-mail at those times. The time zones in which business users conduct most of their business may also impact e-mail delivery patterns. Whatever the pattern of typical e-mail delivery, it is generally desirable to poll mailboxes in proportion to the likelihood that there is actually mail to be retrieved. This feature of the invention makes it possible to efficiently allocate scarce bandwidth and computing resources directed to polling a large number of mailboxes, and contributes to the large scale capacity of the system according to the present invention.
As another feature of the invention, an experience level indicator is maintained for each voice command, within each user profile. The experience level indicator illustrates the user's familiarity with each available voice command. When the user has successfully used a voice command or other system feature, the experience level indicator is changed to reflect that expertise. If the user has demonstrated successful use of a feature several times, then going forward, a reduced level of instruction and assistance may be provided in voice dialog scripts during use of the system when that feature is made available or is in use.
It should be noted that the present invention also provides for the voice commands from a user may be received and acted upon during either silence between output of the text-to-speech engine or during the time that the text-to-speech engine is sending voice to the user. Thus, in a first embodiment (half duplex) , voice commands may be received only when the text-to-speech engine (or voice prompts) are not playing. In a second embodiment (full duplex) , a "voice-barge-in" feature may be provided, whereby a user may talk over prompts or the text-to-speech engine with commands. Thus, for example, during reading of an e-mail message, in a full duplex embodiment, a user may say the command "cancel" or "stop" to stop reading of a message, as opposed to a DTMF input. In a full- duplex embodiment, an echo cancellation circuit (similar to that used by a speak phone) may be used to prevent voice prompts or e- mail messages from being perceived as voice inputs . One method of the present invention is to anticipate reaction to voice dialog and retrieve data in anticipation of such dialog voice. For example, when gathering up all of header information in an e-mail message when it is read, only a portion of such data (e.g., first five messages) may be initially read, such that the text-to-speech engine can read header data while other header data is being received so as to maintain a continuous speech output without interruptions or pauses which would be annoying to a user.
The present invention is embodied in an apparatus, system, and method employed by the assignee of the present application, CrossMedia. The present invention may be demonstrated by calling 877-246-DEMO.
The following is a list of components of the present embodiment of the present invention:
XM Resource Manager - The XM Resource Manager abstracts the voice user interface application from the core speech technology engine. By leveraging this advanced feature, any application written with CrossMedia' s technologies may instantly take advantage of new innovations in voice technology as soon as they are commercially available. The open architecture approach of the XM Resource Manager allows users and applications to be insulated from the complexities of the underlying voice technology, simplifying programming and speeding the adoption of new technology innovations .
XM Dialog Manager - CrossMedia incorporates both ASR and TTS engines, and manages real time allocation of speech resources including user specific grammar and vocabularies required for the effective development of voice dialog applications.
XM Scripting Language - CrossMedia provides a simple Applications Programmer Interface (API) enabling new applications to be developed quickly. In addition to using the formal scripting language, an optional Graphical User Interface (GUI) tool can be used.
XM TTS Preprocessor - The processor translates content into a clear format when read by a text -to-speech engine using CrossMedia' s TTS Conditioning. The XM TTS Preprocessor does extensible parsing and translation to provide auditory meaning to information intended to be read.
XM Personal Profiler - The personal profiler enables users to set system preferences for use with CrossMedia' s Voice Email and Voice Activated Dialing products. Examples include telephone numbers, email addresses, standard email replies and email filters. This module is written as JAVA servlets with an SQL interface to the system database for storage.
XM Email Polling & Management System - This software provides a mechanism for getting copies of a user's email. This is accomplished in one of two ways: email polling and forwarding. The software is written in JAVA and can be easily ported to various platforms .
XM Email Message Classifier - CrossMedia has developed a powerful, rules-based, message management system for classifying messages for filtering and routing. The filtering function enables a user to hear only those messages deemed to be important, filtering out other messages .
XM Applications Gateway - CrossMedia will develop a family of Applications Gateways to access various email systems and database information sources. The gateways will be developed in JAVA and may be architecturally distributed.
XM Resource Manager - This provides expansion capability to meet the demands of large marketing partners . The current system can handle over 50,000 mailboxes and can be expanded by installing additional servers to handle several hundred thousand to over one million mailboxes.
Context-sensitive Active Grammar - This supports the recognition of millions of words and phrases needed for conversational voice accessible applications.
Thus, new and improved systems and methods have been provided to facilitate the retrieval and processing of messages under voice control from any desired location.
While the preferred embodiment and various alternative embodiments of the invention have been disclosed and described in detail herein, it may be apparent to those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope thereof.

Claims

CLAIMSWe claim:
1. A method for playing e-mail messages over a telephone system, comprising the steps of: receiving an e-mail message through a computer network, storing the e-mail message in a database, calling a voice interface computer system via telephone, prompting, with the voice interface computer system, a user to enter voice commands in response to a prompt, speaking a voice command over the telephone to the voice interface computer system, receiving, in the voice interface computer system the voice commands , retrieving, in the voice interface computer system, from the database, a list of possible standard voice command responses corresponding to the prompt, retrieving, in the voice interface computer system, from a user database, a list of possible user voice command responses corresponding to the prompt, comparing the voice command to the list of possible standard voice command responses and the list of possible user command responses, generating, from the list of possible standard voice command responses and the list of possible user voice command responses, a list of guesses of the voice command, generating, for the list of guesses of the voice command, at least one confidence value, each representing correlation of a guess from the list of guesses of the voice command to a corresponding element of the list of possible standard voice command responses and the list of possible user command responses, and identifying the voice command from the list of guesses and the at least one confidence value.
2. The method of claim 1, wherein said step of comparing the voice command to the list of possible standard voice command responses and the list of possible user command responses further comprises the steps of : determining, from the prompt, possible valid responses to the prompt from the list of possible standard voice command responses and the list of possible user command responses, co-catenating selected elements from the list of possible standard voice command responses and the list of possible user command responses to produce a group of expected responses, and comparing the voice command with the expected responses .
3. The method of claim 2, wherein said step of generating, from the list of possible standard voice command responses and the list of possible user voice command responses, a list of guesses of the voice command comprises the step of : generating, from the expected responses, a list of guesses of the voice command.
4. The method of claim 3, wherein said step of retrieving, in the voice interface computer system, from a user database, a list of possible user voice command responses corresponding to the prompt comprises the steps of: retrieving, from the user database, a text list of user data, converting, using a text to speech converter, at least a portion of the text list of user data into speech data, and generating, from the speech data, and the prompt, a list of possible voice command responses corresponding to the prompt .
5. The method of claim 1, further comprising the steps of: comparing the at least one confidence value to a predetermined threshold, and generating a response tone indicating a valid voice command has not been received if the at least one confidence value do not exceed the predetermined threshold.
6. The method of claim 1, wherein the e-mail message is a text message, said method further comprising the steps of: converting the stored e-mail text message to sound using a text to speech converter to produce an audio e-mail message, and playing the audio e-mail message over a telephone to a user.
7. The method of claim 1, wherein the e-mail message contains sound data, said method further comprising the steps of: playing the sound data over a telephone to a user.
8. The method of claim 6, further comprising the steps of: generating a response to an audio-email message in response to an identified voice command.
9. The method of claim 8, wherein the response to the audio e-mail message comprises one a plurality of pre-stored e-mail text message responses .
10. The method of claim 8, wherein the response to the audio e-mail message comprises an audio e-mail response recorded as sound data .
11. The method of claim 1, wherein the list of possible user voice command responses corresponding to the prompt includes a list of user e-mail correspondents.
12. The method of claim 1, further comprising the step of: learning, in the voice interface system, frequency of use of voice commands by a user, wherein said step of prompting, with the voice interface computer system, a user to enter voice commands in response to a prompt further comprises the step of: selectively prompting a user based upon frequency of use of voice commands by a user.
13. The method of claim 1, further comprising the step of: notifying a user that an e-mail message has been received in the database.
14. The method of claim 13, wherein said step of notifying comprises the step of notifying a user than an e-mail message having a predetermined priority has been received in the database.
15. An apparatus for playing e-mail messages over a telephone system, the apparatus comprising: A database server, including: means for receiving an e-mail message through a computer network; and means for storing the e-mail message in a database, and A voice interface server, coupled to the database server, said voice interface server including: interface means for receiving telephone calls to the voice interface server; means for prompting a user to enter voice commands in response to a prompt ; means for receiving, from the user, spoken voice commands over the telephone;
- Ill - means for retrieving, from the database, a list of possible standard voice command responses corresponding to the prompt ; means for retrieving, from a user database, a list of possible user voice command responses corresponding to the prompt ; means for comparing the voice command to the list of possible standard voice command responses and the list of possible user command responses; means for generating, from the list of possible standard voice command responses and the list of possible user voice command responses, a list of guesses of the voice command; means for generating, for the list of guesses of the voice command, at least one confidence value, each representing correlation of a guess from the list of guesses of the voice command to a corresponding element of the list of possible standard voice command responses and the list of possible user command responses; and means for identifying the voice command from the list of guesses and the at least one confidence value.
16. The apparatus of claim 15, wherein said means for comparing the voice command to the list of possible standard voice command responses and the list of possible user command responses further comprises : means for determining, from the prompt, possible valid responses to the prompt from the list of possible standard voice command responses and the list of possible user command responses; means for co-catenating selected elements from the list of possible standard voice command responses and the list of possible user command responses to produce a group of expected responses; and means for comparing the voice command with the expected responses .
17. The apparatus of claim 16, wherein said means for generating, from the list of possible standard voice command responses and the list of possible user voice command responses, a list of guesses of the voice command, comprises: means for generating, from the expected responses, a list of guesses of the voice command.
18. The apparatus of claim 17, wherein said means for retrieving, in the voice interface computer system, from a user database, a list of possible user voice command responses corresponding to the prompt comprises the steps of: means for retrieving, from the user database, a text list of user data; means for converting, using a text to speech converter, at least a portion of the text list of user data into speech data; and means for generating, from the speech data, and the prompt, a list of possible voice command responses corresponding to the prompt .
19. The apparatus of claim 15, further comprising: means for comparing the at least one confidence value to a predetermined threshold; and means for generating a response tone indicating a valid voice command has not been received if the at least one confidence value do not exceed the predetermined threshold.
20. The apparatus of claim 15, wherein said voice interface server further comprises : means for converting a stored e-mail text message to sound using a text to speech converter to produce an audio e-mail message, and means for playing the audio e-mail message over a telephone to a user.
21. The apparatus of claim 15, wherein the e-mail message contains sound data, said voice interface server further comprises: means for playing the sound data over a telephone to a user.
22. The apparatus of claim 21, wherein said voice interface server further comprises: means for generating a response to an audio-email message in response to an identified voice command.
23. The apparatus of claim 22, further comprising: a web server, coupled to the voice interface server and the database server, for receiving, via a network, user input commands including e-mail text message responses, wherein the response to the audio e-mail message comprises one a plurality of user input e-mail text message responses.
24. The apparatus of claim 22, wherein the response to the audio e-mail message comprises one a plurality of predetermined e- mail text message responses.
25. The apparatus of claim 22, wherein the response to the audio e-mail message comprises an audio e-mail response recorded as sound data .
26. The apparatus of claim 15, wherein the list of possible user voice command responses corresponding to the prompt includes a list of user e-mail correspondents.
27. The apparatus of claim 15, wherein said voice interface server further comprises : means for learning frequency of use of voice commands by a user; wherein said means for prompting a user to enter voice commands in response to a prompt further comprises : means for selectively prompting a user based upon frequency of use of voice commands by a user.
28. The apparatus of claim 15, further comprising: means for notifying a user that an e-mail message has been received in the database.
29. The apparatus of claim 28, wherein said means for notifying comprises means for notifying a user than an e-mail message having a predetermined priority has been received in the database .
30. A method for playing e-mail messages over a telephone system, comprising the steps of: receiving an e-mail message through a computer network, storing the e-mail message in a database, calling a voice interface computer system via telephone, prompting, with the voice interface computer system, a user to enter voice commands in response to a prompt, speaking a voice command over the telephone to the voice interface computer system, receiving, in the voice interface computer system the voice commands , retrieving, in the voice interface computer system, from the database, a list of possible voice command responses corresponding to the prompt , comparing the voice command to the list of possible voice command responses command responses, generating, from the list of possible voice command responses, a list of guesses of the voice command, generating, for the list of guesses of the voice command, at least one confidence value, each representing correlation of a guess from the list of guesses of the voice command to a corresponding element of the list of possible voice command responses, and identifying the voice command from the list of guesses and the at least one confidence value.
31. The method of claim 30, wherein said step of comparing the voice command to the list of possible voice command responses further comprises the steps of: determining, from the prompt, possible valid responses to the prompt from the list of possible voice command responses, co-catenating selected elements from the list of standard voice command responses to produce a group of expected responses, and comparing the voice command with the expected responses
32. The method of claim 31, wherein said step of generating, from the list of possible voice command responses, a list of guesses of the voice command comprises the step of: generating, from the expected responses, a list of guesses of the voice command.
33. The method of claim 32, wherein said step of retrieving, in the voice interface computer system, from a user database, a list of possible user voice command responses corresponding to the prompt comprises the steps of: retrieving, from the user database, a text list of user data, converting, using a text to speech converter, at least a portion of the text list of user data into speech data, and generating, from the speech data, and the prompt, a list of possible voice command responses corresponding to the prompt.
34. The method of claim 30, further comprising: comparing the at least one confidence value to a predetermined threshold, and generating a response tone indicating a valid voice command has not been received if the at least one confidence value do not exceed the predetermined threshold.
35. The method of claim 30, wherein the list of possible voice command responses corresponding to the prompt comprises a list of possible standard voice command responses corresponding to the prompt and a list of possible user voice command responses corresponding to the prompt.
36. An apparatus for playing e-mail messages over a telephone system, comprising: means for receiving an e-mail message through a computer network; means for storing the e-mail message in a database; means for calling a voice interface computer system via telephone; means for prompting, with the voice interface computer system, a user to enter voice commands in response to a prompt ; means for speaking a voice command over the telephone to the voice interface computer system; means for receiving, in the voice interface computer system the voice commands ; means for retrieving, in the voice interface computer system, from the database, a list of possible voice command responses corresponding to the prompt; means for comparing the voice command to the list of possible voice command responses command responses; means for generating, from the list of possible voice command responses, a list of guesses of the voice command; means for generating, for the list of guesses of the voice command, at least one confidence value, each representing correlation of a guess from the list of guesses of the voice command to a corresponding element of the list of possible voice command responses; and means for identifying the voice command from the list of guesses and the at least one confidence value.
37. The apparatus of claim 36, wherein said means for comparing the voice command to the list of possible voice command responses further comprises : means for determining, from the prompt, possible valid responses to the prompt from the list of possible voice command responses ; means for co-catenating selected elements from the list of standard voice command responses to produce a group of expected responses ; and means for comparing the voice command with the expected responses .
38. The apparatus of claim 37, wherein said means for generating, from the list of possible voice command responses, a list of guesses of the voice command comprises : means for generating, from the expected responses, a list of guesses of the voice command.
39. The apparatus of claim 38, wherein said means for retrieving, in the voice interface computer system, from a user database, a list of possible user voice command responses corresponding to the prompt comprises : means for retrieving, from the user database, a text list of user data; means for converting, using a text to speech converter, at least a portion of the text list of user data into speech data; and means for generating, from the speech data, and the prompt, a list of possible voice command responses corresponding to the prompt .
40. The apparatus of claim 36, further comprising: means for comparing the at least one confidence value to a predetermined threshold; and means for generating a response tone indicating a valid voice command has not been received if the at least one confidence value do not exceed the predetermined threshold.
41. The apparatus of claim 36, wherein the list of possible voice command responses corresponding to the prompt comprises a list of possible standard voice command responses corresponding to the prompt and a list of possible user voice command responses corresponding to the prompt .
42. A method of converting a text message to speech, comprising the steps of: retrieving and storing the text message in a database, modifying, in a message preprocessor, the stored text message to a revised text message by replacing a sequence of input characters with a corresponding sequence of output characters, converting, using a text-to-speech conversion engine, the revised text message into speech data, and playing the speech data as speech.
43. The method of claim 42, further comprising the step of: storing, in a table, a list of predetermined sequences of input characters and corresponding predetermined sequences of output characters, wherein said step of modifying the stored text message to a revised text message by replacing a sequence of input characters with a corresponding sequence of output characters comprises the step of modifying the stored text message to a revised text message by replacing a predetermined sequence of input characters with a corresponding predetermined sequence of output characters .
44. The method of claim 43, further comprising the step of: inputting, into a user defined table, a list of user defined sequences of input characters and corresponding user defined sequences of output characters, wherein said step of modifying the stored text message to a revised text message by replacing a sequence of input characters with a corresponding sequence of output characters comprises the step of modifying the stored text message to a revised text message by replacing a user defined sequence of input characters with a corresponding user defined sequence of output characters.
45. The method of claim 42, wherein the text message comprises an e-mail message and the sequence of input characters comprises at least a portion of an e-mail formatting sequence of characters .
46. The method of claim 42, wherein the text message comprises an e-mail message and the sequence of input characters comprises an e-mail emoticon and the sequence of output characters comprises an emoticon description.
PCT/US1999/022145 1998-09-24 1999-09-24 Interactive voice dialog application platform and methods for using the same WO2000018100A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU64997/99A AU6499799A (en) 1998-09-24 1999-09-24 Interactive voice dialog application platform and methods for using the same

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10193098P 1998-09-24 1998-09-24
US60/101,930 1998-09-24

Publications (3)

Publication Number Publication Date
WO2000018100A2 true WO2000018100A2 (en) 2000-03-30
WO2000018100A3 WO2000018100A3 (en) 2000-09-08
WO2000018100A9 WO2000018100A9 (en) 2002-04-11

Family

ID=22287229

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/022145 WO2000018100A2 (en) 1998-09-24 1999-09-24 Interactive voice dialog application platform and methods for using the same

Country Status (2)

Country Link
AU (1) AU6499799A (en)
WO (1) WO2000018100A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2377119A (en) * 2001-06-27 2002-12-31 365 Plc Interactive voice response system
GB2380379A (en) * 2001-06-04 2003-04-02 Hewlett Packard Co Speech system barge in control
EP1320963A1 (en) * 2000-09-06 2003-06-25 Xanboo, Inc. Adaptive method for polling
EP1705886A1 (en) * 2005-03-22 2006-09-27 Microsoft Corporation Selectable state machine user interface system
DE102006058552B4 (en) * 2005-12-12 2010-09-23 Honda Motor Co., Ltd. reception system
US8074199B2 (en) 2007-09-24 2011-12-06 Microsoft Corporation Unified messaging state machine
US8090083B2 (en) 2004-10-20 2012-01-03 Microsoft Corporation Unified messaging architecture
CN112230878A (en) * 2013-03-15 2021-01-15 苹果公司 Context-sensitive handling of interrupts

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8379830B1 (en) 2006-05-22 2013-02-19 Convergys Customer Management Delaware Llc System and method for automated customer service with contingent live interaction

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4837798A (en) * 1986-06-02 1989-06-06 American Telephone And Telegraph Company Communication system having unified messaging
US4932021A (en) * 1989-04-03 1990-06-05 At&T Bell Laboratories Path learning feature for an automated telemarketing system
US5566272A (en) * 1993-10-27 1996-10-15 Lucent Technologies Inc. Automatic speech recognition (ASR) processing using confidence measures
US5594784A (en) * 1993-04-27 1997-01-14 Southwestern Bell Technology Resources, Inc. Apparatus and method for transparent telephony utilizing speech-based signaling for initiating and handling calls
US5608786A (en) * 1994-12-23 1997-03-04 Alphanet Telecom Inc. Unified messaging system and method
US5652789A (en) * 1994-09-30 1997-07-29 Wildfire Communications, Inc. Network based knowledgeable assistant
US5675507A (en) * 1995-04-28 1997-10-07 Bobo, Ii; Charles R. Message storage and delivery system
US5715466A (en) * 1995-02-14 1998-02-03 Compuserve Incorporated System for parallel foreign language communication over a computer network
US5740231A (en) * 1994-09-16 1998-04-14 Octel Communications Corporation Network-based multimedia communications and directory system and method of operation
US5825854A (en) * 1993-10-12 1998-10-20 Intel Corporation Telephone access system for accessing a computer through a telephone handset

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4837798A (en) * 1986-06-02 1989-06-06 American Telephone And Telegraph Company Communication system having unified messaging
US4932021A (en) * 1989-04-03 1990-06-05 At&T Bell Laboratories Path learning feature for an automated telemarketing system
US5594784A (en) * 1993-04-27 1997-01-14 Southwestern Bell Technology Resources, Inc. Apparatus and method for transparent telephony utilizing speech-based signaling for initiating and handling calls
US5825854A (en) * 1993-10-12 1998-10-20 Intel Corporation Telephone access system for accessing a computer through a telephone handset
US5566272A (en) * 1993-10-27 1996-10-15 Lucent Technologies Inc. Automatic speech recognition (ASR) processing using confidence measures
US5740231A (en) * 1994-09-16 1998-04-14 Octel Communications Corporation Network-based multimedia communications and directory system and method of operation
US5652789A (en) * 1994-09-30 1997-07-29 Wildfire Communications, Inc. Network based knowledgeable assistant
US5608786A (en) * 1994-12-23 1997-03-04 Alphanet Telecom Inc. Unified messaging system and method
US5715466A (en) * 1995-02-14 1998-02-03 Compuserve Incorporated System for parallel foreign language communication over a computer network
US5675507A (en) * 1995-04-28 1997-10-07 Bobo, Ii; Charles R. Message storage and delivery system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1320963A1 (en) * 2000-09-06 2003-06-25 Xanboo, Inc. Adaptive method for polling
EP1320963A4 (en) * 2000-09-06 2007-05-16 Xanboo Inc Adaptive method for polling
GB2380379A (en) * 2001-06-04 2003-04-02 Hewlett Packard Co Speech system barge in control
GB2380379B (en) * 2001-06-04 2005-10-12 Hewlett Packard Co Speech system barge-in control
US7062440B2 (en) 2001-06-04 2006-06-13 Hewlett-Packard Development Company, L.P. Monitoring text to speech output to effect control of barge-in
GB2377119A (en) * 2001-06-27 2002-12-31 365 Plc Interactive voice response system
US7912186B2 (en) 2004-10-20 2011-03-22 Microsoft Corporation Selectable state machine user interface system
US8090083B2 (en) 2004-10-20 2012-01-03 Microsoft Corporation Unified messaging architecture
EP1705886A1 (en) * 2005-03-22 2006-09-27 Microsoft Corporation Selectable state machine user interface system
DE102006058552B4 (en) * 2005-12-12 2010-09-23 Honda Motor Co., Ltd. reception system
US8074199B2 (en) 2007-09-24 2011-12-06 Microsoft Corporation Unified messaging state machine
CN112230878A (en) * 2013-03-15 2021-01-15 苹果公司 Context-sensitive handling of interrupts

Also Published As

Publication number Publication date
AU6499799A (en) 2000-04-10
WO2000018100A3 (en) 2000-09-08
WO2000018100A9 (en) 2002-04-11

Similar Documents

Publication Publication Date Title
US6651042B1 (en) System and method for automatic voice message processing
US9088652B2 (en) System and method for speech-enabled call routing
US6327343B1 (en) System and methods for automatic call and data transfer processing
US6366882B1 (en) Apparatus for converting speech to text
US7242752B2 (en) Behavioral adaptation engine for discerning behavioral characteristics of callers interacting with an VXML-compliant voice application
US8654940B2 (en) Dialect translator for a speech application environment extended for interactive text exchanges
US6507643B1 (en) Speech recognition system and method for converting voice mail messages to electronic mail messages
EP1602102B1 (en) Management of conversations
US7609829B2 (en) Multi-platform capable inference engine and universal grammar language adapter for intelligent voice application execution
US5644625A (en) Automatic routing and rerouting of messages to telephones and fax machines including receipt of intercept voice messages
US7260530B2 (en) Enhanced go-back feature system and method for use in a voice portal
US20050234727A1 (en) Method and apparatus for adapting a voice extensible markup language-enabled voice system for natural speech recognition and system response
GB2323694A (en) Adaptation in speech to text conversion
US20120046951A1 (en) Numeric weighting of error recovery prompts for transfer to a human agent from an automated speech response system
JP2003244317A (en) Voice and circumstance-dependent notification
US6813342B1 (en) Implicit area code determination during voice activated dialing
US8085927B2 (en) Interactive voice response system with prioritized call monitoring
WO2000018100A2 (en) Interactive voice dialog application platform and methods for using the same
US20030055649A1 (en) Methods for accessing information on personal computers using voice through landline or wireless phones
US20060077967A1 (en) Method to manage media resources providing services to be used by an application requesting a particular set of services
CN1380782A (en) Automatic information system
RU2763691C1 (en) System and method for automating the processing of voice calls of customers to the support services of a company
US7327832B1 (en) Adjunct processing of multi-media functions in a messaging system
Chou et al. Natural language call steering for service applications.

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref country code: AU

Ref document number: 1999 64997

Kind code of ref document: A

Format of ref document f/p: F

AK Designated states

Kind code of ref document: A2

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

AK Designated states

Kind code of ref document: C2

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1-105, DESCRIPTION, REPLACED BY NEW PAGES 1-100; PAGES 106-125, CLAIMS, REPLACED BY NEW PAGES101-119; PAGES 1/25-25/25, DRAWINGS, REPLACED BY NEW PAGES 1/25-25/25; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

122 Ep: pct application non-entry in european phase