WO2007082536A1

WO2007082536A1 - Mobile unit with camera and optical character recognition, optionally for conversion of imaged text into comprehensible speech

Info

Publication number: WO2007082536A1
Application number: PCT/DK2007/000020
Authority: WO
Inventors: Lars Ballieu Christensen; Flemming Ast; John Kristensen
Original assignee: Motto S.A.
Priority date: 2006-01-17
Filing date: 2007-01-17
Publication date: 2007-07-26
Also published as: EP1979858A1

Abstract

A mobile unit with a computer and a camera, the computer being configured to receive captured images as digital data from the camera and to extract text in the captured images by an optical character recognition (OCR) routine and to convert the text from an image format into a text format. The mobile unit further comprises a text database with text words, and the computer is configured to compare the converted text with words in the text database and only to accept the converted text as resembling the imaged text in case of agreement with words in the database. The converted text may than be transformed into synthetic speech.

Description

Mobile unit with camera and optical character recognition, optionally for conversion of imaged text into comprehensible speech

FIELD OF THE INVENTION

The present invention relates to a mobile unit with a computer and a camera, the computer being configured to receive captured images as digital data from the camera and to extract text in the captured images by optical character recognition (OCR) routine and to convert the text from an image, for example for subsequent conversion into comprehensible speech.

BACKGROUND OF THE INVENTION

Dyslexia can have different degrees of missing ability to read and write. In the mild form, dyslexic persons may be able to read and even write, though having difficulty with correct spelling. Modern aids, such as spell check in computers, have helped many dyslexic people to live without severe difficulties. However, in more pronounced dyslexic cases, where the people are not able to read or understand written text in public space, the result may be an inability to move around and travel without the assistance of a person who can read. This lack of ability to read and write, often creates frustration and reduced self esteem with an aggressive or unsure behaviour in daily life.

Wide acceptance of the use of mobile telephones not only for speaking but also in connection with text messages has been achieved among dyslexic people, because the word processor has a built-in spelling aid. In patent application US 2001/0056342 by Piehn et al., a camera is disclosed that is able from a taken image to transform the text in the image into synthetic speech. The camera does also have software routines that translate text from one language to another.

Often, images taken with text are difficult to transform correctly into text files that can be converted into synthetic speech, because poor focus, image distortion or objects obscuring the text reduce the chances for correct character recognition. As a result, the synthetic speech may differ from the text in the image. Thus, there is still a need for improvements in connection with reading aids for dyslexic people. Especially, there is a need for the technical improvement for text recognition from a captured image into text format.

DESCRIPTION / SUMMARY OF THE INVENTION

It is therefore the object of the invention to provide a mobile unit with a camera, sufficient image optimisation capabilities and optical character recognition (OCR), where the probability for a correct recognition of the text is increased, for example in order to increase the correctness of a subsequent speech generated from the image.

This object is achieved with a mobile unit with a computer and a camera, the computer being configured to receive captured images as digital data from the camera and to extract text in the captured images by an optical character recognition (OCR) routine and to convert the text from an image format into a text format. The mobile unit further comprises a text database with text words, and the computer is configured to compare the converted text with words in the text database and only to accept the converted text as resembling the imaged text in case of agreement with words in the database. As an option, the text is translated into synthetic speech using a text-to-speech engine.

The invention is preferably implemented in a mobile telephone having a camera and a generator of synthetic speech. However, the invention is of more general character and can be implemented in other mobile units, such as a PDA without mobile telephone.

The advantage of the mobile unit according to the invention is the additional routine of checking potential text extracted from images against words in a database. The term words also includes one letter words or parts of longer word, such as syllables. In the case that text has been extracted from an image, it may be that some of the characters in the text have been recognised erroneously, rendering the final text is meaningless. This may happen in case of poor focus, in case of partly obscured text, in case that the text is in a different language, or if the extraction routine misinterprets.the image and considers part of the image as text even though there is no text in this part of the image or at least no recognisable text. In case that the converted text is accepted as resembling the text in the image, this may be indicated by the mobile unit.

Thus, with the invention, the possibility that a user is confronted with meaningless text is drastically reduced.

In modern life in the industrialised world, it is customary to possess and use a mobile telephone on a daily basis. Modern mobile telephones comprise built-in cameras with a rather high optical resolution and zooming properties. Therefore, in a preferred embodiment of the invention, the camera according to the invention is implemented in a mobile telephone. This implies that a dyslexic person does not need to carry additional equipment apart from the mobile phone, which is carried along most of the time anyway. Also, use of a mobile phone for photographing text or signs would not be recognised as something remarkable. In addition, the mobile unit according to the invention may comprise a synthetic speech generator for submitting the extracted text to the user by a synthetic voice. The synthetic speech can be listened to through earphones, which already are widely used in connection with mobile phones. For convenience, the earphones may be wireless, for example by utilising Bluetooth technology. Thus, the dyslexic may use an important aid without the risk of being revealed as being disabled.

In addition, the computer may comprise routines that check whether the converted text as a whole makes sense, for example, whether the grammar is correct and whether the words are related to each other.

In certain cases, names of products or companies may imply words that are not found in the database, but which the user nevertheless is familiar with. In cases of missing acceptance, the mobile unit may be configured to request the user to indicate whether a phrase despite a missing counterpart in the database shall be accepted nevertheless.

Furthermore, the mobile unit according to the invention may be configured in case of missing acceptance to amend the initially converted text slightly in order to make it fit to existing words, letters, sequences of words, combined words, and/or parts of sentences in the database. The initially converted text and the amended converted text may be presented to the user as options among which the user may choose the apparently most correct version. For example, the mobile unit may present several possibilities and request the user to indicate, whether there is a version, which seems to resemble the text in the image. The selection is subsequently stored in the database.

In addition, the mobile unit may be configured to base such proposals or to base the amendment or acceptance of the text on earlier choices by the user.

In certain situations, the photographed text may contain special words such as technical terms. This may induce problems, if these terms are specialised in certain fields, where the meaning of the terms is different from the normal meaning of the word. For example, a word as "leg" means a different thing for medical treatment than for mechanical fittings. In this case, in a further embodiment of the invention, the database may contain special kind of technical dictionaries with such kind of special terms. The user may indicate, whether the special dictionaries shall be used during the comparison with the converted text.

Special words or names may among others belong to traffic signs, which is one of the important entities for the dyslexic to read. Therefore, in a further embodiment, the unit also comprises a database with traffic sign text, and where the mobile unit is configured upon specific request from a user of the mobile unit to compare the extracted text from a captured image with the traffic sign text in the database.

Once the extracted text is converted, the user may want to store it for later use, for example to send it as an SMS or IP message to another user. Therefore, the mobile unit is configured upon conversion of the extracted text to request an action from a user of the mobile unit for storing the converted text in a database or data memory in text format.

If the extraction of text has worked in principle by using the OCR routine, but the converted text does not make sense according to the corresponding control routines in the computer, the user may wish to store the result in order to control the meaning at a later stage, possibly with help from others. Therefore, in a further embodiment, the mobile unit is configured in case of missing accept of the converted text to request an indication from a user as to whether the converted text is to be stored in the database or data memory.

Typically, in cameras, the optics are typically of a quality with image distortions near the image edge. This implies that the photographed text may be curved, making the recognition by the software more difficult. Therefore, in a further embodiment, the camera is configured with an image distortion correction routine to correct the image in such a way that distortions are reduced, especially, curved parts of the images are straightened out. For example, the camera may be configured to perform the distortion correction with an algorithm performed on every taken image.

The algorithm may be constructed such that the correction is performed in dependence of the performance of the optics. If the optics are known, the algorithm can be adjusted to perform the correction in a specific way related to the specific type of optics.

Another difficulty in mass produced cheap cameras, such as for mobile telephones, the image resolution is not very high, which is due to the number of pixels in the CCD chip of the camera and partly due to the limited amount of memory available which causes the software of the camera to store images in lower resolution format. If necessary, the number of pixels may artificially be increased in a software routine.

In order to increase the applicability of the apparatus according of the invention in cases where images of text are taken slightly out of focus, in a further embodiment, the camera according to the invention is configured, by suitable software routines, for example by using Fourier analysis and/or high pass filtering, to compensate for low resolution in the image due to defocusing. If application of the corresponding software routine does not result in a satisfactory image, a new image has to be taken.

In order for the camera according to the invention and the text to speech translation program according to the invention to be as user friendly as possible, the invention may be based on a Windows® CE platform. This widely used platform for handheld units, for example mobile telephones or PDA (personal digital assistant), is in structure very similar to Windows® programs on stationary computers, which in turn is widely used, as well. As dyslexic people often are familiar with the Windows® programs on stationary computers, an implementation of the OCR program in a mobile unit according to the invention is s further help for dyslexic people in as much as they are not forced to learn a new platform, which is a much more tedious task for dyslexic people than for others.

In some cases, images may contain text passages that are partly obscured by objects such as dirt or rain on text boards or on traffic signs containing the text. In a further embodiment, the camera according to the invention includes a routine that corrects image obstructions due to such kind of objects.

Sometimes text is not horizontally written on a text board or likewise but vertically. This is especially true for many advertisements, for example names of shops and hotels on building walls. Commercially available OCR programs are able to recognise this and to use it. This is also implemented in the invention. This however requires that the letters are correctly oriented and placed one letter below the other. In case that a text has a large angle, for example 90 degrees, relatively to the camera, commercially available OCR routine are typically programmed to try to recognise the letters as a vertical text, where one letter is placed below the other. However, in this case, the recognition will fail. According to the invention, the camera is programmed to rotate the entire captured image successively to 90 degrees, 180 degrees and also 270 degrees if a proper reading fails. Images may be captured, where the text is not truly horizontal but deviates by a certain angle from the horizontal. Commercially available software programs are configured to recognise letters nevertheless. This is also implemented in the invention. However, when letters in the image are deviating by angles of more than 40 degrees from the horizontal, proper letter recognition often fails, because in this case, the software is configured to assume a vertical text instead. In order to solve this problem, the camera may be programmed to rotate the entire captured image successively by a certain angle, for example 30 degrees or 45 degrees, if a proper extraction fails. After each rotation about this predetermined angle, a new attempt for extraction is performed, until the image has been rotated by 360 degrees. Alternatively, the image is rotated in one, two or three 90 degrees steps.

In a further embodiment, the mobile unit according to the invention is combined with route planner and/or navigation software, for example a commercial product

TomTom®. The implemented program comprises routines that use the recognised text in a captured image, such as a text indicative for a location, for example a sign of the road and a house number, in combination with the route planner and/or navigation software. For example, the dyslexic person may image a road sign and a building number and as a result receive a synthetic voice message explaining the way from the actual location to a certain other location, for example the home of the person. Alternatively or in addition, the route planner in a mobile unit according to the invention may be configured to show the location and the route on a map, or even to explain the route by means of buildings which the dyslexic finds on the route.

In addition, if the user images a name of a location, for example a street name, a GPS (Global Positining System) signal receiver and location routine may be used for finding possible location names at the actual GPS location, for example in a digital name database or in a digital map. By comparison of the imaged name converted in to text with the possible location names, the correct location name may be found quickly. In order to advantageously follow the dyslexic through the town, the mobile unit according to the invention may comprise a GPS (global positioning system) such that the dyslexic person can be guided to the desired location. For example, when visiting a location, the dyslexic may have photographed - for example from a separate tourist brochure - a number of names of locations to visit on a tour. The text recognition routine stores the location names in a memory, after which on request, the location names are matched by a built-in route planner in the mobile unit such that, automatically, a route is planned and by display on a map or by synthetic speech presented to the dyslexic. The GPS system in the mobile unit keeps track of the actual location of the dyslexic and the route planner guides the dyslexic along the planned route and back to the point of origin for the tour or to another final point of interest.

In a further embodiment, the mobile unit according to the invention may in the database comprise a number of dictionaries with different languages. One additional function may be the translation of imaged text, for example as disclosed in US patent application No. 2001/0056342.

Though the mobile unit has a high advantage for dyslexic people, it may also be of interest for non-disabled people. For example, the mobile unit may comprise a route planner but no synthetic speech. This would still be of high interest in certain cases as illustrated in the following. A photographing of a street sign, for example with Chinese characters may be imaged and extracted by using Chinese character setup and a Chinese dictionary. In combination with the route planner, the location may be indicated in the display and a route proposed. A synthetic speech generator may be an additional convenient, but not absolutely necessary feature. The advantage would be that a person not familiar with Chinese characters would still be able to find his way through a town in China.

In a further embodiment, the mobile unit according to the invention comprises a microphone to record voice messages from the user. In addition, the mobile unit may comprise a routine for phonetic translation. Words and Phrases are stored as audio files in a database and are translated into other languages. This means for the person using the apparatus according to the invention that the person may speak into the microphone and get this speech translated into another language, either as an audio datafile with the message spoken in another language or as a text file. The phonetic translation may be performed simultaneously with the speaking of the person.

In a further embodiment, the apparatus according to the invention can be used to simplify daily arrangements. For example, in connection with information providers, a system may be arranged, where brochures and other information may be ordered by sending an SMS (Short Message System) with a certain code from a mobile telephone to a preselected telephone number. For the dyslexic, this may be simplified by imaging the code, for example from an advertisement, and sending the converted code as chracters/digits by SMS to the preselected telephone number.

SHORT DESCRIPTION OF THE DRAWINGS

The invention will be explained in more detail with reference to the drawing, where

FIG. 1 is a flow diagram describing the overall functioning of the mobile unit in a concrete embodiment according to the invention, FIG. 2 is an illustration of compensation for blurred images,

FIG. 3 is an illustration of image rotation,

FIG. 4 is an illustration of the cleanup effect,

FIG. 5 is an illustration of correction of curvature,

FIG. 6 is an illustration of angular compensation.

DETAILED DESCRIPTION / PREFERRED EMBODIMENT

The mobile unit according to the invention may be configured to have several modes. One of the modes is the image capture, text conversion mode and speech mode. In the following called ATR/TTS mode, where the abbreviations refer to automated text recognition (ATR) and test-to-speech process (TTS process). FIG. 1 is a flow diagram illustrating the overall functioning of the mobile unit according to the invention in concrete embodiment of the invention within the ATR/TTS mode.

The ATR/TTS process is divided into three major phases:

- The Launch Phase,

- The Recognition Phase, and

- The Clean-up and Text-To-Speech Phase

Steps 1A, 1B, 1C, 1 D as illustrated in FIG. 2 belong to the Launch Phase. The ATR/TTS process is launched by one of four potential user events:

- The user has clicked on the camera release whilst the device is in ATR/TTS -mode (step 1A);

- the user has clicked the scalable on-screen release whilst the device is in ATR/TTS-mode (step 1 B);

- the user has activated the on-screen release using a voice command whilst the device is in ATR/TTS-mode (step 1C); or

- the user has open an image from the device store using the image browser (step 1D).

In either case, the ATR/TTS process is launched and will complete when one of the following states is reached:

- The ATR/TTS process has successfully recognised text within the image, has produced a speech file and has played the speech file (step 18); or - The ATR/TTS process has failed to recognise text within the image and has played a pre-recorded error message back to the user (step 19).

The user interface used to control the ATR/TTS application is menu-driven. In order to facilitate the use of the menu by dyslexic persons, the menu may be based on images and/or sound indications such that each function in the menu has its own sound, for example a voice message reading the name of function. The user interface furthermore, may comprise icon-based menu items that can be activated by clicking on the display or by using a corresponding set of voice-commands entered into the mobile unit by the user through a microphone. The icon-based menu interface is scalable and may be resized to accommodate user preferences.

The Recognition Phase comprises steps 2-15 and step 19.

Steps 2 -> 3: The ATR/TTS application will immediately attempt to recognise text within the image using the OCR module (Optical Character Recognition - step 2). On success (step 3), the ATR/TTS process will resume at the Clean-up 16 and Text-to-Speech Phase 17. Otherwise (step 3), the ATR/TTS process will continue at step 4 to improve the image.

Step 4 → 5 → 6: Once the initial recognition has failed, the ATR/TTS process will attempt to improve the image quality using a variety of image manipulation techniques and technologies. First, the image is processed by the simulated autofocus module (step 4) resulting in a clearer image, which also is illustrated in FIG. 2. This software routine may, for instance, use high pass filtering and Fourier transformation in order to make optical edges sharper. In case that the number of pixels in the image is do not fulfil the requirements for handling by the OCR module, the number of pixels is subsequently increased artificially in order to match the requirements of the OCR module (step 5 and step 6)

Step 7 -) 8 → 9 → 10: Once the image quality has been improved, the ATR/TTS process will make another attempt to recognise text within the image (step 7). On success (step 8), the ATR/TTS process will resume at the Clean-up 16 and Text-to- Speech Phase 17. Otherwise (step 8), the ATR/TTS process will continue through a succession of 90° image rotations (step 9), each time attempting to recognise text within the image (step 7) until the image has been rotated by a total of 90°, 180° and 270° (step 10). This is illustrated in more detail in FIG. 3, where the original image not correctly oriented and the first rotation results in an image that is upside down.

Step 11 : If the simulated autofocus and increase in the image resolution fail to render an image that can be successfully processed, the ATR/TTS process will attempt to increase the contrast between the text and the background to the extend of dividing the image into two parts using scaled binary threshold value: (1) Text; and (2) everything else. This is illustrated further in FIG. 4, where spots in the image are removed to make the image more clear and increase the contrast. Furthermore, the ATR/TTS process will attempt to compensate for (a) any optical curving, which is illustrated in FIG. 5, and/or (b) any other optical distortions caused as a result of the image not being captured at an angle of 90°, which is illustrated in FIG. 6.

Step 12 → 13 -> 14 → 15: Once the image quality has been improved, the ATR/TTS process will make another attempt to recognise text within the image (step 12). On success (step 13), the ATR/TTS process will resume at the Clean-up and Text-to-Speech Phase. Otherwise (step 13), the ATR/TTS process will continue through a succession of 90° image rotations (step 14), each time attempting to recognise text within the image (step 12) until the image has been rotated by a total of 90°, 180° and 270° (step 15).

Step 19: If this does not result in successful recognition, the ATR/TTS process is terminated with an error message (step 19).

The Clean-up and Text-To-Speech Phase comprises step 16-18.

Step 16: Once text has been recognised in one of the recognition attempts, the text is passed on for clean-up (step 16). The clean-up task will remove non-printable characters and other characters not part of the current language setting; furthermore, the text will be matched against a database with common words and parts of words, and names of locations to increase the quality of the text.

After step 16, the text is converted from the image format into a text format, which can be used in other applications, for example as shown in Step 17, where the cleaned up text is subsequently passed on to the text-to-speech engine and the resulting synthetic speech is stored in an audio file (step 17).

Step 18: Finally, the audio file is played back and control passed back to the user (step 18). After step 16, the text file may be used in other applications as well, for example for translation into other languages or for interaction with a route planner in order to display locations on a map, to show routes on a map, or by synthetic voice to guide a person around in the environment.

Claims

1. A mobile unit with a computer and a camera, the computer being configured to receive captured images as digital data from the camera and to extract text in the captured images by an optical character recognition (OCR) routine and to convert the text from an image format into a text format, characterised in that the mobile unit further comprises a text database with text words, and wherein the computer is configured to compare the converted text with words in the text database and only to accept the converted text as resembling the imaged text in case of agreement with words in the database.

2. A mobile unit according to claim 1, wherein the mobile unit is configured to check in accordance with pre-programmed rules in computer routines whether the converted text as a whole implies a correct grammar according to these rules and whether the words are correctly related to each other according to these rules.

3. A mobile unit according to any preceding claim, wherein the mobile unit is configured in case of missing acceptance to amend the initially converted text to such a degree that there is achieved congruence to existing words, letters, sequences of words, combined words, and/or parts of sentences from the database in accordance with the pre-programmed rules.

4. A mobile unit according to claim 3, wherein the mobile unit is configured to present the initially converted text and the amended converted text to the user as options among which the user may choose the apparently most correct version.

5. A mobile unit according to any preceding claim, wherein the mobile unit also comprises a database with traffic sign text and geographical names, and where the mobile unit is configured upon specific request from a user of the mobile unit to compare the extracted text from a captured image with traffic sign text or geographical names in the database.

6. A mobile unit according to any preceding claim, wherein the mobile unit is configured upon conversion of the extracted text to request an action from a user of the mobile unit for storing the converted text in a database or data memory in text format.

7. A mobile unit according to claim 6, wherein the mobile unit is configured in case of missing accept of the converted text to request an indication from a user as to whether the converted text is to be stored in the database or data memory.

8. A mobile unit according to any preceding claim, wherein the mobile unit is configured in the case of missing recognition of text characters or in the case that the converted text does not correspond to any text in the text database, to request the capture of another image.

9. A mobile unit according to any preceding claim, wherein the mobile unit is configured to transform the converted text into synthetic speech.

10. A mobile unit according to any preceding claim, wherein the computer comprises a routine for correcting bending distortions of the captured image before text extraction.

11. A mobile unit according to any preceding claim, wherein the computer comprises a routine for checking whether objects obscure text in the captured image, and in the affirmative to replace the obscuring objects in the image by second objects resembling parts of text characters in order to restore the partly obscured text.

12. A mobile unit according to any preceding claim, wherein the computer comprises a routine for automatic rotating the image by a certain angle, for example 30, 45, or 90 degrees, if a first attempt of text extraction fails, and where the routine is configured, if a second attempt after the first rotation fails, to initiate successive rotation of the image until the text is extracted or until a preset maximum rotation has been performed.

13. A mobile unit according to any preceding claim, wherein the mobile unit comprises a route planner program and is configured for using the extracted and converted text in combination with the route planner for finding locations or routes to locations, and wherein the mobile unit is configured for indicating the location or route to the location.

14. A mobile unit according to claim 13, wherein the mobile unit is configured for indicating the location or route to the location by synthetic speech.

15. A mobile unit according to any preceding claim, wherein the mobile unit comprises a route planner program and a GPS signal receiver functionally connected to the route planner program.

16. A mobile unit according to any preceding claim, wherein the mobile unit comprises a telephone.

17. A mobile unit according to claim 16, wherein the mobile unit is configured to send converted text as SMS or IP messages upon indication by a user of the mobile unit.