WO2016016863A1

WO2016016863A1 - Speech recognition using models associated with a geographic location

Info

Publication number: WO2016016863A1
Application number: PCT/IB2015/055825
Authority: WO
Inventors: Andrew Mcnamara; Sam Pasupalak; Eric WASYLISHEN; Wilson Hsu
Original assignee: Maluuba Inc.
Priority date: 2014-08-01
Filing date: 2015-07-31
Publication date: 2016-02-04
Also published as: EP3198593A4; EP3198593A1

Abstract

A natural language system for recognizing geographic specific language embodied within a query received at a computing device is disclosed. A given territory such as a country may be divided into sub-territories. The data source content may be limited to a predetermined number 5 of each type of entity determined by establishing a radius for each type of entity from the center of the particular sub-territory, and only including each entity with the distance of the radius. One or more sentence templates may be gathered from common queries, and training sentences may be created by substituting entities into the sentence patterns. When the natural language system receives a query, the system may apply a speech recognition module associated with 10 the geographic location of the computing device so that geographic specific language such as businesses, street and cities may be recognized by the particular speech recognition model.

Description

SPEECH RECOGNITION USING MODELS ASSOCIATED WITH A GEOGRAPHIC LOCATION FIELD

[0001] The present specification relates to automatic speech recognition, and particularly to speech recognition for particular geographic locations.

BACKGROUND

[0002] Certain speech recognition systems can achieve high levels of accuracy when a domain is well defined and/or specialized. For example, a speech recognition system designed for medical practitioners may achieve a high level of accuracy because the language model used by the speech recognition system contains specific words commonly expressed by a medical practitioner. The speech recognition system optimized for the medical field may perform very poorly, however, if the terms from another profession, for example, law, are received.

[0003] General language speech recognition systems employ general language models and may also achieve acceptable levels of accuracy for some applications. General systems, however, suffer from low accuracy when certain words and phrases not contained in the language model of the speech recognition system are received as input. For example, general language models may not contain specialist jargon (such as medical terms), words from a different language and certain nouns such as street, place and city names. When a word or phrase that is not provided in the language model is received, the system will attempt to find the best match which can be incorrect. SUMMARY

[0004] This summary is provided to introduce a selection of representative concepts in a simplified form to illustrate various aspects of the invention that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.

[0005] Broadly speaking, the invention relates to recognizing language that may be associated with a particular geographic area embodied in a query received by a computing device. The query may be an audio file on a computer and/or a command received by one or more microphones of the computing device. In one accordance with an aspect of the invention, a software application (also referred to herein as an app), running on a computing device, presents a user interface to a user for receiving uttered (voiced) audio queries.

[0006] An aspect of the invention provides a computer-implemented method for recognizing geographic specific language associated with a particular geographic area. The method involves identifying the geographic coordinates associated with the location of the computing device. The coordinates may be obtained in a variety of ways, for example, by retrieving the GPS coordinates from the computing device. A data source associated with the particular geographic area may be accessed that contains entity information. Each entity may be a word or phrase that represents elements such as business, roads, cities, etc. that are within a particular geographic area. At least one sentence template (pattern) may be obtained which includes a phrase and one or more entity variables. A plurality of training sentences are created by substituting entities into entity variables of the corresponding types. A language model is trained using the training sentences and given an identifier that may include the geographic coordinates or another geographic identifier.

[0007] In accordance with another aspect, the entities may be provided by interfacing with a content provider configured to provide entities of certain types, for example, Yelp for restaurant and business entities, Open Street Maps for cities and roads, and so forth.

[0008] An aspect of the invention provides a system for creating geographic specific language models configured to recognize geographic specific language that may embodied in a query received at a computing device such as a smartphone. The system includes a language model generator that is configured to create a plurality of geographic language models, each of which is associated with a particular geographic area. The language model generator interfaces with a data source containing entities associated with a particular geographic area and at least one sentence template that includes text and one or more entity variables. A sentence generator may be provided for creating a plurality of training sentences for each geographic language model, the training sentences created by substituting entities from the entity source into the sentence templates. A language model trainer is provided for creating the language model for each particular geographic area.

[0009] In operation, the invention provides a networked system for recognizing geographic specific language that may be embodied in a query received at a computing device. The system includes an input device for receiving the query and for converting the uttered query into a digital audio format such as raw audio. A speech recognition module is provided for converting the digital audio file into a text format which may be processed to determine the intention of the query or for another purpose. The speech recognition module may include a general language module, configured to recognize general language embodied within the input query, and a geographic language module, configured to recognize geographic specific language embodied within the input query. A language model selector is provided for determining which particular geographic language model is to be applied to the input query based on the location of the computing device. Typically, an input query is received by the computing device and is directed to the general language module which produces a general text representation of the query. The input query is also directed to the geographic language module which produces a geographic specific text representation of the query. A fusion module may be provided for creating a third text representation of the query that incorporates some text produced by the general language module, and some text produced by the geographic language module.

[0010] Other aspects and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] Reference will now be made, by way of example only, to the accompanying drawings in which:

[0012] Figure 1 is a block diagram of an exemplary networked environment of an system for providing information and tasks that incorporates the geographic language recognizer of the invention, according to one embodiment;

[0013] Figure 2 is a schematic representation of a central server in accordance with the embodiment shown in figure 1 ;

[0014] Figure 3 is a block diagram of some components of an exemplary computing device that may be used with the embodiment shown in figure 1 ;

[0015] Figure 4 is a block diagram of some components of an exemplary speech recognizer of the invention, in accordance with one embodiment;

[0016] Figure 5 is a flow diagram showing the lifecycle of an input query in accordance with an embodiment;

[0017] Figure 6 is a map of the world divided up into exemplary regions (squares) covering a predetermined area;

[0018] Figure 7 is an illustration of one of the squares of figure 6 further sub-divided into particular geographic areas, each of which may be associated with a particular geographic language model;

[0019] Figure 8 is a flow diagram showing a method for creating a language model for a particular geographic area, according to one embodiment;

[0020] Figure 9 is a flow diagram showing a method for creating a language model for a particular geographic area, according to one embodiment;

[0021] Figure 10 is a flow diagram for recognizing geographic language embodied within an input query, according to an embodiment;

[0022] Figure 11 is a block diagram of an exemplary networked environment of an system for providing information and tasks that incorporates the geographic language recognizer of the invention, according to another embodiment;

[0023] Figure 12 is a block diagram of some components of an exemplary computing device that may be used with the embodiment shown in figure 11 ; and

[0024] Figure 13 is a block diagram of some components of an exemplary computing device that may be used with the embodiment shown in figure 11.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0025] In this document, the terms geographic language, geographic-specific language, and the like refer to language that may be associated with a geographic area, such as city and town names, business names and street names.

[0026] By way of example only, figure 1 is a schematic representation of a networked system is generally shown at 100. The system 100 may be configured to provide services and/or information to computing devices 102-1 - 102-n (Generically, these computing devices are referred to as "computing device 102" and collectively they are referred to as "computing devices 102"). It is to be appreciated by a person of skill in the art, with the benefit of this description, that a smartphone is one example of a computing device. In general, the computing device 102 can be any type of computing device used to communicate with a central server 50 having an intelligent services engine 110 over the network 106. It is to be appreciated that, in general, the computing device 102 includes programming instructions in the form of codes stored on a computer readable medium for performing functions. For example, the computing device 102 can be any one of a personal computer, a laptop computer, server, a portable electronic device, a gaming device, a mobile computing device, a portable computing device, a tablet computing device, a personal digital assistant, a cell phone, a smartphone, set-top boxes, electronic voice assistants in vehicles or the like. In the present embodiment, the computing device 102-1 is configured to receive an audio input query 120 from a user, to send the audio input query 120 to the intelligent services engine 110, and to provide information in response to the audio input query 120. [0027] In the present embodiment, the central server 50 can be any type of computing device generally used to receive input, process the input and provide output. The central server 50 is not particularly limited and can include a variety of different devices depending on the specific application of the central server 50. For example, the central server 50 can be optimized for its specific role in the system 100, such as for receiving a digital audio query 202 and converting it to a text query. Suitable devices the central server 50 can include high performance blade server systems running UNIX operating systems, and having multiple processors. Alternatively, the central server 50 can include devices such as a personal computer, a personal digital assistant, a tablet computing device, cellular phone, or laptop computer configured to carry out similar functions for systems not requiring a server with significant processing power. In other embodiments, the central server 50 can also be implemented as a virtual server, a rented server session in the cloud, or any combination of the above.

[0028] Referring to figure 2, a schematic block diagram showing various components of the central server 50 is provided. It should be emphasized that the structure in figure 2 is purely exemplary and several different implementations and configurations for the central server 50 are contemplated. The central server 50 includes a network interface 60, a memory storage unit 64, and a processor 68.

[0029] The network interface 60 is not particularly limited and can include various network interface devices such as a network interface controller (NIC) capable of communicating with the computing devices 102 across the network 106. In the present embodiment, the network interface 60 is generally configured to connect to the network 106 via a standard Ethernet connection.

[0030] The memory storage unit 64 can be of any type such as non-volatile memory (e.g. Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory, hard disk, floppy disk, optical disk, solid state drive, or tape drive) or volatile memory (e.g. random access memory (RAM)). In the present embodiment, the memory storage unit 64 is generally configured to temporarily store digital audio queries 202 received from the computing device 102 for processing. In addition, the memory storage unit 64 is configured to store codes for directing the processor 68 for carrying out the operations associated with the intelligent services engine 110.

[0031] The processor 68 is not particularly limited and is generally configured to execute programming instructions to operate the intelligent services engine 110 for converting a digital audio query 202 into a text query. The manner by which the conversion is carried out is not particularly limited and will be discussed in greater detail below.

[0032] An embodiment of a typical computing device 102 that may be used with the invention is described with respect to figure 3. In one embodiment, an audio input query 120 is received by an application 101 on the computing device 102 which directs the digital audio query 202 to an intelligent services engine 110 for processing. The intelligent services engine 110 may include an automatic speech recognition system 200 (ASR system 200) configured to produce a text representation of the digital audio query 202, and direct the text representation to other components of the intelligent services engine 110 for processing. The intelligent services engine 110 may include a natural language processing system (not shown) configured to derive the intent of the input query and extract relevant entities from the audio input query 120. As will be appreciated, many computing devices 102-1 , 102-2, 102-n may simultaneously access the intelligent services engine 110 over a network 106 such as the Internet. Although ASR system 200 is shown as a component of intelligent services engine 110 running on the central server 50 in figure 1 , it should be appreciated that ASR system 200 may be a component of another module or may be a standalone module, may reside on computing device 102 or any other computing device, such as an additional server, or may have its constituent components distributed in a cloud-based services infrastructure.

[0033] In a typical interaction, an audio input query 120 is received at a microphone 336 on the computing device 102 that is running an application 101 . The microphone converts the input query (i.e. sound waves) into a digital audio format (such as a raw audio format, for example, a WAV file) and the application 101 directs the digital audio query 202 to the intelligent services engine 110 over network 106. ASR system 200 converts the digital audio query 202 into a text representation thereof by employing the geographic language functionality of the invention where necessary, and the text representation (also referred to herein as the text query) is processed by a natural language processing engine. The natural language processing engine derives the intention of the audio input query 120 and extracts the relevant entities from the text query. In the present embodiment, the natural language processing engine uses a conditional random field model to derive the intention. However, it is to be appreciated, with the benefit of this description that the method used by the natural language processing engine to derive an intention is not particularly limited and that other methods can be used. For example, other embodiments can use a random forest model, hidden Markov model, neural networks, recurrent neural networks, recursive neural networks, deep neural networks, naive Bayes, logistic regression techniques, decision tree learning, rule based extraction methods, etc. A services manager may interface with an internal or external service to accomplish the intention of the audio input query 120, and the result may be formatted and displayed by application 101 running on computing device 102.

[0034] As an example, application 101 and intelligent services engine 110 may support navigation related queries. An audio input query 120 such as "How do I get to La Paloma" can be received as sound waves by at the microphone 336 of the computing device 102 running the application 101 . The computing device 102 converts the sound waves into a digital audio query 202, and the application 101 directs the digital audio query to the intelligent services engine 110. The ASR system 200 of the intelligent services engine 110 may employ functionality of the invention to correctly produce the text query of "How do I get to La Paloma" which matches the input query. Intelligent services engine 110 may process the text query to derive that the intention of the audio input query 120 (the intention being to find directions a particular place) and further extract the entity "La Paloma". Intelligent services engine 110 may then use the a current GPS coordinates of the computing device 102 (which is passed to intelligent services engine 110 by the application 101) and the extracted entity "La Paloma" to interface with a third- party navigation system such as MapQuest, Google Navigation, etc. to retrieve the directions. The results provided by the third-party navigation system may then be formatted and presented by application 101 on the computing device 102.

[0035] Figure 4 illustrates a schematic representation of one embodiment of the ASR system 200. ASR system 200 receives digital audio query 202 which is a digital audio file provided by the application 101 after converting the audio input query 120 into a digital format. ASR system 200 includes a general speech module 204 for producing a general language text representation of the digital audio query 202, and a geographic speech module 208 for producing a geographic specific text representation of the digital audio query 202. General speech module 204 includes a general language model 206 that may include common words and phrases uttered in a particular natural language such as English. The query produced by the general speech module 204 may be referred to herein as the general text query. A digital audio query 202 that contains only words found in the general speech model 206 may be property recognized by the general language module 204, but the general language model 206 will not properly recognize words that are not included in the general language model 206 such as uncommon words, words from another language or words that are made up for a particular purpose. Last names, street names, and business (e.g. restaurant) names are examples of language that may not property be recognized by the general speech module 204 since such words and phrases are often not contained in the general language model 206. It is to be appreciated by a person skilled in the art that although the general language module 204 is illustrated as being part of the ASR system 200, the general language module 204 can be stored and applied external of the central server 50. For example, the general language module 204 can be carried out by an external third party source specializing in general speech recognition.

[0036] ASR system 200 includes a geographic speech module 208 for producing a text representation of the digital audio query 202 that may include geographic specific language (also referred to herein as the geographic text query). Geographic speech module 208 includes geographic language models 210a-n. (Generically, these geographic language models are referred to as "geographic language model 210" and collectively they are referred to as "geographic language models 210") Each geographic language model may contain geographic language associated with a particular geographic area. Each particular geographic area may be defined by a square, rectangle, circle, and so forth, and in some cases geographic areas may overlap (i.e. a specific geographic location may be associated with more than one language model 210 and may contain some of the same geographic language). A language model selector 212 may be provided to determine which language model 210 is to be applied to the query 202 depending on the current location of the computing device 102. In one embodiment, the current GPS position of the computing device 102 can be sent to the intelligent services engine 110 together with the digital audio query 202. Language model selector 212 may include a predetermined grid of a broad geographic region (such as the United States) and determine in which grid the computing device is located. Once the particular geographic grid (area) is determined, language model selector 212 may select the geographic language model 210 associated with the identified grid and the general language text representation is produced using the selected geographic language model 210.

[0037] ASR system 200 includes an acoustic model 220 for creating statistical representations of the sounds that make up each word in the digital audio query 202. General speech module 204 and geographic speech module 208 may use the same acoustic model 220, and in some cases, different acoustic models 220 may be applied depending on the geographic region in which the computing device 102 is located, an accent in the audio input query 120, a particular environment (i.e. noise), and so forth.

[0038] ASR system 200 may include a language comparator 214 for determining differences between the general text representation produced by the general speech module 204 and the geographic text representation produced by the geographic speech module 208. In some embodiments, language comparator 214 includes a phoneme generator component and a phoneme alignment component. The phoneme generator component may produce a phonetic representation of the general text query and the geographic text query respectively. In various embodiments, the phoneme alignment component determines which phonemes of the general text query correspond to phonemes of the geographic text query.

[0039] Language comparator 214 may employ suitable search algorithms alone or in combination such as dynamic programming techniques, A ^* (star) search algorithms, Viterbi algorithm and so forth to align phonemes of the general text query and the geographic text query. Language comparator 214 may also determine that likelihood that parts of the geographic text query better represent the audio input query 120 than corresponding parts of the general text query.

[0040] A fusion module 216 may be provided for producing a fused text representation (i.e. fused text query, third text query, or final text query) which may combine some text from the geographic text query with the general text query. Fusion module 216 may interface with language comparator 214 to decide the components of the final text query. In various embodiments, the language comparator 214 determines the likelihood that each word (or phoneme) of the geographic text query is likely present in the audio input query 120, and if the likelihood of each word is greater than a predetermined threshold, the fusion module 216 formulates the final text query by substituting a portion of the geographic text query into the general text query.

[0041] As an example, an audio input query 120 representing the phrase "How do I get to La Paloma" can be received by the application 101 running on the computing device 102. The microphone 336 of the computing device 102 receives and converts the sound waves into digital audio which is sent to the ASR system 200. ASR system 200 employs a general speech module 204 to produce a general text query and a geographic speech module 208 to recognize geographic language associated with the particular geographic area of the computing device 102. The language model selector 212 uses the GPS coordinates (which are sent by application 101 together with the digital audio) of the computing device 102 to determine the language model 210 that is associated with the particular geographic area of the computing device 102. Using the selected language model 210, the geographic speech module 208 produces a geographic text query. It is to be appreciated by a person of skill in the art that manner by which the geographic area is determined is not particularly limited. For example, other embodiments can us an IP address of the computing device 102 when it is connected to a know WIFI network to locate the computing device 102.

[0042] Continuing with the example, assume that the general text query produced by the general speech module 204 (also referred to as general language module) is "How do I get to lap mole" and the geographic text query produced by the geographic speech module is "Hausa Igeto La Paloma". The general text query contains common words from the primary language of the computing device (i.e. English) and the geographic text query contains non-English words or uncommon English words that are associated with entities (e.g. restaurants) that are within the particular geographic area of the computing device 102.

[0043] Language comparator 214 may review the general text query and the geographic text query and determine the likelihood that each word in the general text query and the geographic text query were present in the input query. Fusion module 216 may use the likelihoods to construct a final text query that may contain some words from the general text query and some words from the geographic text query. In the example, fusion module 216 constructs a final text query of "How do I get to La Paloma" using the likelihoods determined by comparator 214 and predetermined thresholds on whether to include a particular word in the final text query. As can be seen, the final text query in this example contains "How do I get to" from the general text query and "La Paloma" from the geographic text query. The final text query correctly matches the audio input query 120 received at the computing device 102.

[0044] In various embodiments, ASR system 200 includes a language model updater 218 configured to update general speech module 204 and geographic speech module 208 and the models 206 and 210 associated thereto. In some embodiments (as will be described later), the language model updater 218 interfaces with content providers configured to provide entity information for particular geographic areas, and updates the language model associated with each particular geographic area accordingly at certain times such as periodic intervals.

[0045] Referring to figure 3, a block diagram of certain components of a computing device in accordance with an embodiment is indicated generally by the numeral 102.

[0046] Computing device 102 is based on a computer that includes a microprocessor 338 (also referred to herein as a processor) connected to a random access memory unit (RAM) 340 and a persistent storage device 342 that is responsible for various non-volatile storage functions of the computing device 102. Operating system software executable by the microprocessor 338 is stored in the persistent storage device 342, which in various embodiments is flash memory. It will be appreciated, however, that the operating system software can be stored in other types of memory such as read-only memory (ROM). The microprocessor 338 receives input from various input devices including the touchscreen 330, communications device 346, and microphone 336, and outputs to various output devices including the display 324, the speaker 326 and the LED indicator(s) 328. The microprocessor 338 is also connected to an internal clock 344.

[0047] In various embodiments, the computing device 102 is a two-way RF communication device having voice and data communication capabilities. Computing device 102 also includes Internet communication capabilities via one or more networks such as cellular networks, satellite networks, Wi-Fi networks and so forth. Two-way RF communication is facilitated by a communications device 346 that is used to connect to and operate with a data-only network or a complex voice and data network (for example GSM/GPRS, CDMA, EDGE, UMTS or CDMA2000 network, fourth generation technologies, etc.), via the antenna 348.

[0048] Although not shown, a battery provides power to the active elements of the computing device 102.

[0049] The persistent storage device 342 also stores a plurality of applications (such as application 101) executable by the microprocessor 338 that enable the computing device to perform certain operations including the communication operations referred to above. Other applications software may be provided including, for example, an email application, a Web browser application, an address book application, a calendar application, a profiles application, and others that may employ the functionality of the invention. Various applications and services on computing device 102 may provide application programming interfaces for allowing other software modules to access the functionality and/or information available by interfaces.

[0050] Referring to figure 5, a flow diagram is shown that illustrates a possible lifecycle (i.e. method 400) of an audio input query 120. At 402, an audio input query 120 is received by the application 101 on the computing device 102. The microphone produces a digital audio file which is sent at step 404 to the delegate service together with the GPS coordinates of the computing device 102. The digital audio query may be streamed as it is being received at the microphone 336 to optimize processing. The application 101 on the computing device 102 may call the ASR service when the an audio input query 120 is received at the computing device 102 so that the ASR service (which implements ASR system 200) can download the language model 210 that is associated with the particular geographic area of the computing device 102. The delegate service may be the entry point for queries into the intelligent services engine 110 which coordinates communication between constituent services and communicates results to the computing device 102. As will be appreciated, delegate service, NLP service, ASR service, navigation service, and so forth may include load balancers to scale each service as required depending on load requirements.

[0051] At step 406, delegate service may call a natural language processing (NLP) service to process the input query. The general language module may be employed initially so that NLP service can classify the query into a particular domain. The term "domain" as used herein refers to a particular field of thought, information and action, such as weather, calendar, restaurants, email, etc. associated a particular input query. NLP service may then call ASR service at step 408 which is responsible for recognizing geographic language that may be embodied in the input query. At step 410, the ASR service downloads the geographic language model and ASR service generates a geographic text query and creates a final text query containing parts of general text query and the geographic text query if necessary using language comparator and fusion module as described with reference to figure 4.

[0052] Although not shown, once the final text query is created by ASR service and provided to NLP service, additional rounds of statistical classification may be performed on the final text query in order to identify the intention of the audio input query 120 with increasing accuracy and to perform entity extraction. Entity extraction is a process for identifying particular entities in a query that may be necessary to perform the command intended by the input query. In the previous example, the restaurant "La Paloma" may be extracted and provided to a navigation service at step 412 together with the current location of the computing device 102 so that directions to La Paloma can be provided to the computing device 102 as intended. At step 414, the navigation results of the query are communicated to the computing device 102 which are displayed (or otherwise outputted) on the display 324 of the computing device 102 by the application 101 .

[0053] The navigation service shown in figure 5 is merely exemplary given that the query "How do I get to La Paloma" is a navigation-related query. In some embodiments, intelligent services engine 110 can be configured to select and interface with a variety of internal and external services, each of which may be adapted for a particular purpose. For example, the intelligent services engine 110 may select a weather service for weather-related queries, a news service for news-related queries, a calendar service to perform calendar-related actions and so forth.

[0054] Reference is next made to figures 6-9 to describe the process and functionality for creating geographic language models associated with different geographic areas. Figure 6 shows a map of the world divided into grid cells known as World Meteorological Organization squares or WMO squares. WMO squares is a system of geocodes that divides a chart of the world with latitude-longitude gridlines (e.g. plate carree projection, Mercator or other) into grid cells of 10° latitude by 10° longitude, each with a unique, 4-digit numeric identifier. The speech recognition functionality of the invention may further divide all or particular WMO squares (covering for example, a specific country like the United States) into additional squares or other shapes. For example, square 502 (also designated 7307) covers a large section of the southeast coast of the United States. It may be desirable to further divide such squares into smaller units as shown in figure 7. Square 502 has been subdivided into 16 smaller geographic areas each of which have the same area. An implementation of the invention may choose subdivisions of any desired area and shape. Square 610 is one of the smaller geographical units that lies entirely within square 502. The speech recognition functionality of the invention creates a geographic language model for each subdivision, and any query received when the computing device is in a particular geographic area (such as square 610) may be applied to the geographic language model 210 associated with the particular geographic area.

[0055] Circles A, B, C are centered on the center of square 610 and define predetermined geographic entity boundaries that delimit the extent of geographic language coverage for certain types of entities. For example, in one implementation, circle A may be an entity boundary for places such as businesses and restaurants, circle B may be an entity boundary for street names, and circle C may be an entity boundary for cities and towns. An implementation of the invention may use any number of entity types and any desired boundary for each entity. When creating a geographic language model for a particular geographic area such as square 610, the language model updater 218 may access a data source for each entity and populate the language model for a particular geographic area with entities that are within the entity boundary for each entity type. For example, in one implementation a language model for a particular geographic area (such as square 610) may include places language within a 10 mile radius of the area's center, street language within a 20 miles radius of the area's center, and city/town language within a 50 mile radius of the area's center. The updater 218 may access a different data source for each entity type that is configured to provide data for each particular entity type. As mentioned previously, updater 218 may periodically update each language model to capture changes in entity data.

[0056] Referring to figure 8, an exemplary flow of operations (methods) is shown for creating a language model 210 associated with a particular geographic area. In the present embodiment, the method 700 is carried out on a training server (discussed further below), or the central server 50. It is to be appreciated by a person skilled in the art that in other embodiments, the method 700 can be carried out on the central server 50 as well. In general, the method 700 shows the general steps in the method with more detail of each step shown in the outlined circles. At step 710, a gazateer is built for each entity type, for example for city entities, places entities and streets entities. In this specification, a gazateer is a directory (list) of entities. The city gazateer, for example, may contain a number of city and town names within a predefined distance (eg. radius) of the center of a given geographic area such as square 610. At step 720, a relative scoring may be used for each entity in the gazateers to determine whether a particular entity should be included in the gazateer for that entity type. For example, cities/towns must have a minimum threshold population, streets must have a minimum threshold length, and business must have a minimum rating on a content provider website in order to be included in the associated gazateer. The particular scoring thresholds, or whether thresholds are used at all, may be tailored for a particular implementation.

[0057] At step 730 the gazateers may be saved to a server, and at step 740, a training sentence corpus (also referred to as a sentence template herein) is built from existing data. The training sentences may be generated by recognizing commonly audio input queries 120 and replacing entities with entity variables. A large number of training sentences can then be generated by substituting actual entities from the gazateer into the entity variables. For example, it may be found that in a given implementation, a common input query 120 is of the form "get me directions from New York to Cincinnati". The entities may be extracted from this sentence and replaced by entity variables leading to the sentence pattern "get me directions from X to Y". Using this sentence pattern, and perhaps other sentence patterns, a large number of training sentences may be generated by substituting various entities into the entity variables. For example, using the above sentence patterns, additional sentences such as "get me directions from Montreal to Laval", "get me directions from San Francisco to San Mateo", "get me directions from San Jose to Sunnyvale", "get me directions from Brooklyn to Yonkers", and so forth may be generated.

[0058] Such training sentences may be generated to create a language model configured to recognize predetermined queries in a particular geographic area. At step 750, a language model is created for each particular geographic area (i.e. square) which may then be deployed for use.

[0059] Entity information may be acquired from a variety of sources such as Open Street Map. Each gazateer (i.e. directory of information for a certain entity type) may be limited to a particular number of results 724, 726, 728 depending on the type of entity. For example, places information may be restricted to a number (example 1000) so that only that number of business are used to create training sentences. In various embodiments, the entity boundary (i.e. radius) for a particular entity type may adjusted so that the number of entities listed in the gazateer for that particular entity type is limited to a certain number of entities. In the present embodiment, it can be assumed that the result 728 is not added to the gazateer. The score 732, 734 for each entity (city, street, etc.) may be stored together with the entity in a server for use by the ASR system or another software service.

[0060] At 742, queries already uttered, collected or generated are grouped together for a particular domain (e.g. navigation). The entities may be removed as described above and replaced with entity variables (such as X and Y, FROM_LOCATION and TO_LOCATION, etc.) to create sentence templates. The manner by which sentence templates are creates is not particularly limited. For example, some embodiments can include sentence templates obtained from an external source. In the present embodiment, the sentence templates may be obtained from actual queries received by the intelligent services engine 110, and grouped together so that the most popular queries may be found at step 744. Once the entities are replaced by entity variables, queries of the same form will look alike. A predetermined number of sentence templates, for example, commonly uttered queries (e.g. 20 most common queries) may then be identified and used to generate training sentences by substituting entities from a gazateer into the associated entity variables.

[0061] At 752, the entities are downloaded from each entity gazateer, and a dictionary is built at step 754. Building the dictionary may involve organizing the entities in alphabetical order. At 756, a plurality of training sentences are generated using the common queries collected in 744. Finally, at 758, a language model is built for each particular geographic area using the generated sentences for the particular area.

[0062] Turning to figure 9, an exemplary flow of operations (methods) is shown at 800 for creating a plurality of particular language models associated with particular geographic areas that cover a larger area such as a country, continent, region and so forth. At 802, the larger area is divided into particular geographic areas such as the WMO squares as shown in FIGS. 5 and 6. Each particular geographic area is given an identifier and the geographic demarcation of each area is stored in association with its identifier. At 804, for each geographic area, gazateers are built that contain entity information for each entity type. Creating entity gazateers may involve establishing an entity boundary for each entity type (e.g. the radii shown in figure 7 that are centered on the center of the particular area) and populating the gazateers with a predetermined number of entities that are within each entity boundary. A third-party data source or content provider such as Open Street Map, Yelp, Google Places, and the like may be used to provide entity information. At 806, common queries for a particular domain that requires geographic ASR are identified, and the entities are replaced with entity variables to create sentence patterns. The common sentences may be used for each language model since a query of the form "How do I get from A to B", in which "A" and "B" are variables is likely to be useful. At 808, a predetermined number of training sentences are generated using the common queries by substituting entities from the gazateers into the entity variables. The generated sentences are unique for each geographic area since they are formed by substituting entities for a particular area into the sentence patterns. In this way a large number of training sentences may be generated using a relatively small number of common queries. At 810, a language model may be built for each particular geographic area. It is to be appreciated that the method 800 is not particularly limiting and that optional improvements can also be implemented. For example, each language model can be adapted to accommodate environmental background noises associated with a geographic area as well as for accents.

[0063] Reference is made to figure 10 to describe exemplary operations at 900 for recognizing geographic language that may be embodied in an input query. At 902, an input query is received as a digital audio file at an intelligent services engine 110 for processing. At 904, a general text query is created using a general language ASR module, so that preliminary classification may be performed on the general query. At 906, classification (statistical, rules- based, ontology, etc.) is performed on the general text query to determine if the input query requires geographic speech recognition, such as by determining the domain associated with the input query. For example, a query such as "Get me directions to Breithaupt Street" may require geographic speech recognition (since navigation queries may include geographic language not found in a general language model) while a query such as "Wake me up at 6 am" may not require geographic speech recognition (since the query involves setting an alarm and the language in the query is likely found in a general language model). If the query does not require geographic speech recognition (as determined at 908), the query may be further processed by a natural language processing module at 910 and the requested task may be performed according to the intention of the input query. Additional natural language processing at 910 may involve additional rounds of classification to further identify the intention of the query or entity extraction to extract relevant entities from the query.

[0064] If geographic speech recognition is required (as decided at 908), the flow of operations continues via the 'yes' branch to 912, in which the particular geographic language model is downloaded that is associated with the location of the computing device 102. At 914, a geographic text query is generated from the digital audio using the downloaded geographic language model. At 916, a fused (i.e. final) text query is created that may contain a portion of the general text query and geographic text query, and the final text query is directed to 910 for additional natural language processing if required.

[0065] Referring to Figure 11 , another embodiment of a computer network system is shown generally at 100a. Like components of the system 100a bear like reference to their counterparts in the system 100, except followed by the suffix "a". The system 100a includes a central server 50a and a plurality of computing devices 102a-1 , 102a-2, ... 102a-n. In addition, the system 100a also includes a training server 54a. The central server 50a, the computing devices 102a, the training server 54a are connected by a network 106a. The network 106a is not particularly limited and can include any type of network such as those discussed above in connection with the network 106.

[0066] In the present embodiment, the central server 50a can be any type of computing device generally used to receive input, process the input and provide output. The central server 50a is not particularly limited and can include a variety of different devices such as those described above in connection with the central server 50.

[0067] In the present embodiment, the system 100a further includes a training server 54a. The training server 54a can be any type of computing device capable of communicating with the central server 50a as well as receiving training input. It is to be appreciated that the training server 54a includes programming instructions in the form of codes stored on a computer readable medium for training a geographic language model. For example, the training server 54a can be any one of a personal computer, a laptop computer, a portable electronic device, a gaming device, a mobile computing device, a portable computing device, a tablet computing device, a personal digital assistant, a cell phone, a smart phone or the like. In the present embodiment, the training server 54a is generally configured to create a geographic language model by carrying out a method such as the method 700 previously described.

[0068] It is to be re-emphasized that the system 100a described above is a non-limiting representation only. For example, although the network 106a of the present embodiment shown in figure 11 connects the central server 50a, the training server 54a, and the computing devices 102a, other embodiments can include separate networks for connecting the central server 50a, the training server 54a, and the computing devices 102a separately.

[0069] Referring to figure 12, a schematic block diagram showing various components of the computing device 102a is provided. Like components of the computing device 102a bear like reference to their counterparts in the computing device 102, except followed by the suffix "a". It should be emphasized that the structure in figure 12 is purely exemplary and several different implementations and configurations for the computing device 102a are contemplated. The computing device 102a includes includes a microprocessor 338a connected to a random access memory unit (RAM) 340a and a persistent storage device 342a. Operating system software executable by the microprocessor 338a is stored in the persistent storage device 342a. The microprocessor 338a receives input from various input devices including the touchscreen 330a, communications device 346a having an antenna 348a, microphone 336a, and outputs to various output devices including the display 324a, the speaker 326a and the LED indicator(s) 328a. The microprocessor 338a is also connected to an internal clock 344a. In addition, the computing device 102a includes a global positioning system unit 350a having a global positioning receiver to determine the geographic coordinates of the computing device 102a.

[0070] Referring to figure 13, a schematic block diagram showing various components of the training server 54a is provided. It should be emphasized that the structure in figure 13 is purely exemplary and several different implementations and configurations for the training server 54a are contemplated. The training server 54a includes a processor 360a, a network interface 320a, a memory storage unit 370a, and an optional input device 380a.

[0071] The network interface 320a is not particularly limited and can include various network interface devices such as a network interface controller (NIC) capable of communicating with the central server 50a across the network 106a. In the present embodiment, the network interface 320a is generally configured to connect to the network 106a via a standard Ethernet connection.

[0072] The memory storage unit 370a can be of any type such as non-volatile memory (e.g. Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory, hard disk, floppy disk, optical disk, solid state drive, or tape drive) or volatile memory (e.g. random access memory (RAM)). In the present embodiment, the memory storage unit 370a is generally configured to sentence templates, gazateers, and results of training. In addition, the memory storage unit 370a is configured to store codes for directing the processor 360a for carrying out computer implemented methods. For example, the codes can include the programming instructions 365a further described below.

[0073] The optional input device 380a is generally configured to receive training input from a trainer. It is to be appreciated that the input device 380a is not particularly limited and can include a keyboard, microphone, a pointer device, a touch sensitive device, or any other device configured to generate signals in response from the external environment. For example, in the present embodiment, the input device 380a is a microphone configured to receive audio input. In other embodiments, the input device can be omitted and training input can be received via the network interface 320a, such as when the training is carried out remotely (eg. using a computing device 102a, etc) or crowd sourced over a network.

[0074] The processor 360a is not particularly limited and is generally configured to execute programming instructions 365a for creating a geographic language model The manner by which the geographic language model is created is not particularly limited and can include the steps described in the method 700.

[0075] While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. For example, any of the any of the elements associated with intelligent services engine 110 and ASR system 200 may employ any of the desired functionality set forth hereinabove. Furthermore, in various embodiments the intelligent services engine 110, ASR system 200, and all other systems, modules, models, etc. may have more components or less components than described herein to employ the desired functionality set forth herein. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described embodiment.

[0076] Headings of sections provided in this patent application and the title of this patent application are for convenience only, and are not to be taken as limiting the disclosure in any way.

[0077] Devices that are in communication with each other need not be in continuous communication with each other, unless expressly specified otherwise. In addition, devices that are in communication with each other may communicate directly or indirectly through one or more intermediaries.

Claims

What is claimed is:

A server for converting an audio input query to a text query, the server comprising: a network interface for receiving the audio input query from a computing device; a memory storage unit for storing programming instructions, the audio input query, and a geographic language model associated with a geographic area; and a processor in communication with the network interface and the memory storage unit, wherein the programming instructions direct the processor to receive a general text representation of the audio input query and to access the geographic language model and apply the audio input query to a geographic speech recognition system, the geographic speech recognition system using the geographic language model to recognize geographic specific language embodied within the audio input query, the geographic speech recognition system producing a geographic specific text representation of the audio input query, wherein the programming instructions further direct the processor to create the text query by substituting at least a portion of the geographic specific text representation into the general text representation.

The server of claim 1 , wherein the general text representation is generated by a third party source.

The server of claim 1 , wherein the memory is further configured to store a general language model.

The server of claim 3, wherein the programming instructions further direct the processor to access the general language model, to apply the audio input query to a general speech recognition system, the general speech recognition system comprising the general language model configured to recognize general language embodied within the audio input query, the general speech recognition system producing the general text representation of the audio input query.

5. The server of any one of claims 1 to 4, wherein the programming instructions further direct the processor to receive geographic coordinates associated with the geographic area.

6. The server of claim 5, the geographic coordinates are based on data from a global

positioning receiver of the computing device.

7. The server of claim 5, the geographic coordinates are based on an IP address of the

computing device. 8. A method of converting an audio input query to a text query, the method comprising: receiving the audio input query from a computing device; receiving a general text representation of the audio input query; applying the audio input query to a geographic speech recognition system using a geographic language model to recognize geographic specific language embodied within the audio input query, wherein the geographic speech recognition system produces a geographic specific text representation of the audio input query; and creating the text query by substituting at least a portion of the geographic specific text representation into the general text representation.

9. The method of claim 8, further comprising generating the general text representation using a third party source.

10. The method of claim 8, wherein receiving the general text representation comprises

receiving the general text representation from a general speech recognition system. 1 1 . The method of claim 10, further comprising applying the audio input query to the general speech recognition system using a general language model, wherein the general speech recognition system produces the general text representation of the audio input query.

12. The method of any one of claims 8 to 1 1 , further comprising receiving geographic coordinates associated with a geographic area.

13. The method of claim 12, wherein receiving the geographic coordinates comprises receiving data from a global positioning receiver of the computing device.

14. The method of claim 12, wherein receiving the geographic coordinates comprises receiving a location based on an IP address of the computing device. 15. A training server for generating a geographic language model, the training server

comprising: a network interface configured to accessing a data source comprising entity language associated with a geographic area; a memory storage unit for storing programming instructions and data received from the data source; and a processor in communication with the network interface and the memory storage unit, wherein the programming instructions direct the processor to identify geographic coordinates associated with the geographic area, to generate a plurality of training sentences by substituting geographical language into a sentence template, and to receive training data associated with the plurality of training sentences to generate the geographic language model.

16. The training server of claim 15, wherein the data within the data source comprise street entities, place entities and city entities.

17. The training server of claim 16, wherein the place entities comprise business information.

18. The training server of claim 17, wherein the data source is populated with entities located within a predefined distance of the geographic area.

19. The training server of claim 18, wherein the predefined distance is determined by establishing a particular radius away from a center determined for the geographic area.

20. The training server of claim 19, wherein the street entities, place entities and city entities are associated with their own particular radius, and the data source is populated with street entities, place entities and city entities that are within predetermined radius respective to each entity type.

21 . The training server of claim 20, wherein the particular radii are determined so that a

predetermined number of entities of each type are included in the data source.

22. The training server of any one of claims 15 to 21 , wherein the sentence template is identified from common sentence patterns uttered by users to a natural language system.

23. A computer-implemented method of creating a language model associated with a particular geographic area, the language model for deploying with a natural language system for accepting queries uttered by users, the method comprising: identifying geographic coordinates associated with the particular geographic area; accessing a data source comprising entity language associated with the particular geographic area; obtaining a sentence template; generating a plurality of training sentences by substituting geographical language into the sentence template; and training a language model using said plurality of training sentences.

24. The method of claim 23, wherein the entities within the data source comprise street entities, place entities and city entities.

25. The method of claim 24, wherein the place entities comprise business information.

26. The method of claim 25, wherein the data source is populated with entities located within a predefined distance of the particular geographic area.

27. The method of claim 26, wherein the predefined distance is determined by establishing a particular radius away from a center determined for the particular geographic area.

28. The method of claim 27, wherein the street entities, place entities and city entities are

associated with their own particular radius, and the data source is populated with street entities, place entities and city entities that are within predetermined radius respective to each entity type.

29. The method of claim 28, wherein the particular radii are determined so that a predetermined number of entities of each type are included in the data source.

30. The method of any one of claims 23 to 29, wherein the sentence template is identified from common sentence patterns uttered by users to the natural language system.

31 . A computer-implemented method of creating location specific language models for a

territory, the method comprising: dividing the territory into a predetermined collection of sub-territories and assigning an identifier to each sub-territory; accessing a data source, for each territory, containing language associated with each sub-territory; obtaining at least one sentence template; creating a plurality of training sentences for each sub-territory by substituting language in the data source associated with the sub-territory into the at least one sentence template; and training a language model for each sub-territory using said plurality of training sentences for the sub-territory.

32. A computer-implemented method of recognizing language embodied within a query uttered by a user associated with a geographic area, the method comprising: retrieving geographic coordinates associated with the geographic area; accessing a general language model; applying the query to a general speech recognition system, the general speech

recognition system comprising the general language model configured to recognize general language embodied within the query, the general speech recognition system producing a general text representation of the query; accessing a geographic language model associated with the geographic coordinates; applying the query to a geographic speech recognition system, the geographic speech recognition system comprising the geographic language model, and configured to recognize geographic specific language embodied within the query, the geographic speech recognition system producing a geographic specific text representation of the query; and creating a third text representation of the query by substituting at least a portion of the geographic specific text into the general text representation.

33. A computer program, residing on a computer-readable medium, for recognizing geographic specific language embodied in a query associated with a geographic area uttered by a user of a natural language system, and comprising instructions for causing a computer to: associate the geographic area with an identifier; the identifier associated with geographic coordinates of the geographic area; access a data source comprising entity language associated with the geographic area; obtain at least one sentence template; generate a plurality of training sentences by substituting geographical language into the at least one sentence template; and train a geographic language model using said plurality of training sentences to allow geographic specific language embodied within the query to be recognized by a speech recognition module employing the geographic language model.