US20070053611A1 - Method and system for extracting information from a document - Google Patents
Method and system for extracting information from a document Download PDFInfo
- Publication number
- US20070053611A1 US20070053611A1 US11/544,693 US54469306A US2007053611A1 US 20070053611 A1 US20070053611 A1 US 20070053611A1 US 54469306 A US54469306 A US 54469306A US 2007053611 A1 US2007053611 A1 US 2007053611A1
- Authority
- US
- United States
- Prior art keywords
- document
- type
- data element
- record
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000001747 exhibiting effect Effects 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 43
- 230000000875 corresponding effect Effects 0.000 claims description 22
- 238000012015 optical character recognition Methods 0.000 claims description 14
- 238000003384 imaging method Methods 0.000 claims description 9
- 230000002596 correlated effect Effects 0.000 claims description 2
- 238000004590 computer program Methods 0.000 claims 3
- 238000000605 extraction Methods 0.000 description 28
- 238000012545 processing Methods 0.000 description 16
- 238000013075 data extraction Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 12
- 238000012360 testing method Methods 0.000 description 9
- 238000013459 approach Methods 0.000 description 8
- 238000011160 research Methods 0.000 description 7
- 238000012795 verification Methods 0.000 description 7
- 230000011218 segmentation Effects 0.000 description 6
- 238000010191 image analysis Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000003703 image analysis method Methods 0.000 description 3
- 239000012141 concentrate Substances 0.000 description 2
- 230000008602 contraction Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013481 data capture Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007373 indentation Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
Definitions
- the present invention relates to methods and systems for extracting information (e.g., data with context) from a document. More specifically, preferred embodiments of the present invention relate to extraction of information from printed and imaged documents including a regular table structure.
- documents may pose additional challenges for automating the data extraction process.
- the challenges include sparse tables, tables with rows spanning a varied number of lines, parts of a row not present (missing data elements, lines), extraneous text (special printed notes or handwritten annotations), varied number of records per document page, and records broken by the end of a page.
- irregularities related to record structure such as the previous ones, common problems related to scanning (e.g., skewed and rotated images), as well as OCR errors should be anticipated.
- FIG. 1 illustrates a multi-page “claim detail section” 100 of a document broken by the end 102 of page 45 101 .
- the break 102 occurs in the middle of a table 104 .
- totals 106 for the page are included.
- the table is continued on the next page 103 after page header information 108 and an abbreviated identification 110 of the continued record.
- Some of the image analysis methods focus on low-level graphical features to determine table segmentation. Some methods employ a line-oriented approach to table extraction. In those methods, lines or other graphical landmarks are identified to determine table cells. Other methods employ a connected component analysis approach.
- a box-driven reasoning method was introduced to analyze the structure of a table that may contain noise in the form of touching characters and broken lines. See Hori, O., and Doermann, D. S., “Robust Table-form Structure Analysis Based on Box-Driven Reasoning,” ICDAR-95 Proceedings, pp. 218-221, 1995.
- the contours of objects are identified from original and reduced resolution images and contour bounding boxes are determined. These primary boxes and other graphical features are further analyzed to form table cells.
- table structure recognition is based on textual block segmentation. Kieninger, T. G., Table Structure Recognition Based on Robust Block Segmentation,” Proceedings of SPIE , Vol. 3305, Document Recognition V, pp. 22-32, 1998.
- One facet of that approach is to identify words that belong to the same logical unit. It focuses on features that help word clustering into textual units. After block segmentation, row and column structure is determined by traversing margin structure. The method works well on some isolated tables, however it may also erroneously extract “table structures” from non-table regions.
- Tables have many different layouts and styles. Lopresti, D., and Nagy, G., “A Tabular Survey of Automated Table Processing,” in Graphics recognition. Recent Advances , vol. 1941 of Lecture Notes in Computer Science, pp. 93-120, Springer-Verlag, Berlin, 2000. Even tables representing the same information can be arranged in many different ways. It seems that the complexity of possible table forms multiplied by the complexity of image analysis methods has worked against the production of satisfactory and practical results.
- Hurst provides a thorough review of the current state-of-the-art in table-related research. Hurst, M. F., “The Interpretation of Tables in Texts,” PhD Thesis, 301 pages, The University of Edinburgh, 2000. Hurst notes that table extraction “has not received much attention from either the information extraction or the information retrieval communities, despite a considerable body of work in the image analysis field, psychological and educational research, and document markup and formatting research.” As possible reasons, viewed from an information extraction perspective, Hurst identifies lack of current art and model, no training corpora, and confusing markup standards. Moreover, “through the various niches of table-related research there is a lack of evolved or complex representations which are capable of relating high- and low-level aspects of tables.”
- Table understanding typically involves detection of the table logic contained in the logical relationships between the cells and meta descriptors. Meta descriptors are often explicitly enclosed in columns and stub headers or implicitly expressed elsewhere in the document. The opposite approach requires little or no understanding of the logic but focuses on the table layout and its segmentation. This dual approach to table processing is also reflected in patent descriptions.
- a second group of patents concentrates on retrieving tabular data from textual sources.
- graphical representation of the document is ignored and what counts is mainly text including blanks between texts.
- table components such as table lines, caption lines, row headings, and column headings are identified and extracted from textual sources.
- Pyreddy, P., and Croft, B. “Systems and Methods for Retrieving Tabular Data from Textual Sources,” U.S. Pat. No. 5,950,196, September 1999.
- the system may produce satisfactory results with regard to the data granularity required for human queries and interpretation. However, it would not likely be applicable for database upload applications.
- the invention includes a computer-implemented method for extracting information from a population of subject documents.
- the method includes modeling a document structure.
- the modeled document structure includes at least a document component hierarchy with at least one record type.
- Each record type includes at least one record part type and at least one record part type comprising at least one data element type.
- preferred embodiments of the invention identify data of a type corresponding to at least one modeled data element type. Identified subject document data is then associated with the corresponding modeled data element type.
- the invention includes a method for horizontally aligning a first region of a document with a second region of a document where each region is characterized by a plurality of sub-regions.
- This embodiment includes determining a type for each of a plurality of sub-regions in each region and then determining an edit distance for each typed first region sub-region, typed second region sub-region pair.
- a first sub-region offset is calculated for those pairs characterized by an edit distance not greater than a threshold.
- a first region offset is determined as a function of the individual first region sub-region offsets.
- regions correspond to pages and sub-regions correspond to lines.
- FIG. 1 is an example of a long record, in accordance with a preferred embodiment of the present invention, broken by the end of a page;
- FIG. 2 is an example of a record consisting of a header, a table, and a footer, in accordance with a preferred embodiment of the present invention
- FIG. 3 is an example of a document page with structural patterns decomposed in to three different records in accordance with a preferred embodiment of the present invention
- FIG. 4 is an example of data element selection in accordance with a preferred embodiment of the present invention.
- FIG. 5 is another example of data element selection in accordance with a preferred embodiment of the present invention, including meta-data indicated for extraction;
- FIG. 6 illustrates variations of generalizing a line pattern in accordance with preferred embodiments of the present invention
- FIG. 7 illustrates a line pattern data structure in accordance with preferred embodiments of the present invention.
- FIG. 8 illustrates a record data structure in accordance with preferred embodiments of the present invention
- FIG. 9 illustrates a data element data structure in accordance with preferred embodiments of the present invention.
- FIG. 10 illustrates horizontal offset between two aligning line patterns, in accordance with preferred embodiments of the present invention.
- FIG. 11 illustrates steps in data element capture in accordance with preferred embodiments of the present invention.
- a wide variety of printed documents exhibit complex data structures including textual or numerical data organized in rows and columns, i.e., tables, along with more general structures, one of which may be broadly characterized as consisting of one or more contextually-related elements including possibly tables, i.e., records.
- Contextual relationship typically binds the data elements together in such structures. The relationship can be communicated by characteristics, such as sequence of occurrence, spatial distribution, and landmark (e.g., keyword, graphical reference) association.
- characteristics such as sequence of occurrence, spatial distribution, and landmark (e.g., keyword, graphical reference) association.
- data organized in tables, and more generally in records presents itself as structural patterns recognizable by detecting the above characteristics, among others.
- the Flexrecord type definition includes up to three component types: header, table, and footer.
- the header includes the set of data elements within a record that are in a specified proximity to the beginning of the record.
- the table includes the repetitive, variable-length parts of the record. If a record contains a table, a footer may follow as the third component part type of a Flexrecord.
- the footer includes the set of data between the end of the table and the end of the record.
- FIG. 2 ( 1 ) gives an example of a record 200 .
- Lines 1 - 5 contain a header 202
- lines 6 - 8 contain a table 204
- lines 9 - 11 contain a footer 206 . Given the three components of the record, five variants of records are noted:
- footer Since footer must appear after the varying part of the record (the table), two combinations (header+footer) and (footer) are not considered. They are just treated as a header.
- FIG. 3 shows an example of three different records describing various parts of a document page 300 .
- the first record 310 captures information from the page header. In fact, a Flexrecord header part is sufficient to describe all data elements contained there.
- Two records 322 , 324 of the same type capture the middle part of the page 320 . These records 322 , 324 involve all three record parts: header 325 , table 327 , and footer 329 .
- the bottom part of the page 330 can be considered a header part of another record. This combination of records decomposes the problem of data extraction from the document consisting of three structural patterns.
- Preferred embodiments of the present invention implement data extraction from complex document structures in two broad steps: training and extraction.
- a model of the paradigmatic document structure for a population of subject documents is constructed using three inputs: a document component hierarchy (DCH) in the manner of declarations of one or more Flexrecords, spatial and character data from a training document, and information supplied by a user relating the spatial and character data to the DCH hierarchy.
- DCH document component hierarchy
- the DCH is built based on a record declaration provided by a user.
- a user decides the number of records, makes a selection of elementary data elements in each part of the record, assigns a name to each data element, outlines one or more whole records and the table part of each, and saves descriptions in an external file.
- a record template that reflects the requirement for defining various structural and elementary data elements might be encoded in the following way:
- R is an arbitrary letter indicating the record definition
- ⁇ number ⁇ is a number identifying the record,“.” is a separator
- ⁇ name ⁇ is a specific name assigned to the data element.
- the spatial distribution of characters and text content of the training document includes character and positional data obtained from Optical Character Recognition (OCR).
- OCR Optical Character Recognition
- Each recognizable character in the document is described by its ASCII code, XY position, width, and height.
- a document is characterized at several levels of abstraction, e.g., page, line, word, and character levels using data beyond that gathered by conventional OCR.
- the page level includes descriptions of all pages included in the document. These include page dimensions and margin offsets.
- the line level includes total number of document lines; vertical offsets to line bases from the beginning of the document, and numbers of words in each line. An OCR engine predetermines division of a document into lines.
- the word level includes the total number of document words, horizontal offsets to words' start and end as measured from the line beginning, and number of characters in each word.
- the character level includes the total number of document characters, horizontal offsets to characters, their widths and heights, as well as the complete document text.
- the text includes words separated by single spaces; multiple spaces are compressed.
- One operation performed in preferred embodiments of the present invention, in order to facilitate accurate measurements, is data horizontal alignment.
- One source of misaligned data is misaligned pages affected by the scanning or imaging processes.
- Data alignment also may be needed because data on printed (or electronic) copies may be laid out differently in the original document production system. For instance, odd and even pages may have different margins on the left side.
- Data alignment is used in both the training and extraction process of preferred embodiments of the present invention.
- several benchmark statistics over various document and record parts that can involve different pages are gathered. For example, when calculating deviation in distances between similar or dissimilar line patterns the characters in the evaluated lines should be properly aligned. Otherwise, misalignment by a width of one character may produce erroneous data and lead to incorrect decisions.
- the data alignment algorithm is based on the same assumption as is the source of the structural patterns: the repeatability of the data and meta-data in the documents of the same type. In other words, if the given document type contains structural data then it should be possible to align data using the algorithm presented below.
- Data alignment is different from typical page or form alignment that is based on the assumption that there exist so-called registration marks on pages or forms. Such marks may be preprinted on the page or a fragment of a form, or in some instances, graphics (logos, form lines) can be used for registration.
- Finding the most similar line patterns involves three processing steps. First, line patterns are generated using generalization variation 2 (numeric and alphabetic characters). Generalization is described in detail below. Next, similarity between pairs of lines is found using an algorithm measuring distance between two strings. The algorithm penalizes each character insertion, deletion, and substitution with an equal cost. Similar line patterns have low ratio (close to 0) of the distance and the length of the smaller line pattern. Finally, the most similar pairs of line patterns are selected as those with the lowest ratio that is below some dissimilarity threshold.
- the algorithm used in the process of determining which line patterns are similar does not involve character positional data because this data is not completely reliable due to the misaligned pages.
- the algorithm calculates edit distance by the dynamic programming techniques as known to those skilled in the art. Given two strings, source string s 1 s 2 . . . s n , and target string t 1 t 2 . . . t m , the algorithm finds distance d n,m . Intermediate distances d i,j are defined as follows:
- Preferred embodiments of the invention leverage a user's ability to distinguish relevant parts of a document from irrelevant parts.
- the training step information performed by the user involves indicating the structure of one or more records and actual type of contents for extraction, as well as naming the particular data element types that have been indicated. This step links the spatial distribution of characters and the text content to the document component hierarchy (DCH).
- DCH document component hierarchy
- the structure and the scope of the record, its various parts, and specific data elements selected for extraction are indicated by drawing bounding boxes around the components.
- the contents of the record are indicated by drawing a box around the whole record.
- the table part is indicated by drawing a box around the table contents. This action also implicitly separates the header and the footer parts.
- Each particular data element or structure is indicated by a separate bounding box and a name is selected from a predefined set of names.
- Data elements selected for extraction from the header and footer are outlined and named. In the table part, data elements of only one row need to be defined. In case of a missing column entry, the user may draw an empty box as a column data placeholder. Among many rows in the given table, any row can be selected for defining table data elements; as long as the row is representative of the row type. Any subset of columns can be selected for extraction.
- an advantage of other embodiments in which the user specifies record parts in the DCH and indicates at least the table (the line-to-line repeating part of a record) as a record part is that the user can typically recognize the table more reliably than can a program executing on a computer. This is especially true where not every entry in the table contains data.
- R 1 .T 404 Two structural parts are outlined: the complete record (R 1 ) 402 , and the table inside the record (R 1 .T) 404 .
- R 1 .T 404 Four (4) data elements 406 are identified within the header 408 .
- the data element ANX Center is shown labeled as R 1 .Provider 410 .
- the remaining definitions pertain to specific data elements selected for extraction. All data elements selected in the header and footer are outlined and given names. (For picture clarity, only a few names are shown).
- R 1 .T 404 data elements for service dates 412 , procedure 414 , deductible 416 , and payment 418 are indicated.
- FIG. 4 shows a deductible 416 data element labeled as R 1 .T.Deductible 420 .
- the figure shows data element 2010 . 00 430 selected and identified as R 1 .F.Balance 432 in footer R 1 .F (unlabeled) 440 .
- Additional footer 440 elements are indicated for deductible total (i.e., 167.50) 434, payment total (i.e., 3010.00) 436 , and total paid (e.g., 1000.00) 438 .
- “Group” 440 and “inventory” 442 fields were not selected for extraction at this time but they can be added to the model at any time later.
- FIG. 5 presents some alternative ways of defining data elements for extraction.
- the underlying record 500 in FIG. 5 is identical to the record 400 illustrated in FIG. 4 .
- the data elements and corresponding descriptors for Name and Account # are identified as a single data element 510 .
- the two data elements in the header first line are combined and extracted as one line with their descriptors as part of one data element.
- Data elements 522 , 524 from the header second line are extracted separately. They are also proceeded with their descriptors 526 , 528 , i.e., ID and Group#.
- the descriptors Provider 532 and Inventory # 534 from the third line are not extracted, but the data elements 536 , 538 corresponding to those descriptors are indicated for extraction.
- the table header information is defined for extraction.
- the table header is extracted before the table.
- preferred embodiments of the present invention round out the model of the document structure by deriving type definitions for the whole document, each line pattern, each record, and each data element.
- the document definition contains descriptions that enable recognition of document and record parts and navigation through the document.
- Lines exhibit easily recognizable visual patterns. Spacing between the columns of a table along with column data types clearly identify the table part of a record. Indentation, spacing, meta-descriptors, sequence, and content itself identify other parts of the record or the document.
- the document model stores generalized line descriptions as line patterns (LP). Generalization of a line description corresponds to identifying the type of character at each position in the line.
- LSP line sequence pattern
- This pattern is maintained by storing the order of line patterns (in preferred embodiments, line patterns are typically stored in the record type definition they are associated with).
- LPs and LSPs are useful in recognizing and processing various parts of the document, especially enclosed records. There may be gaps between line patterns in the sequence. The gap extent can be estimated based on the training document(s).
- a line pattern may be created in at least three variations: variation 1 —numeric; variation 2 —numeric & alphabetic; variation 3 —numeric & alphabetic data.
- variation 1 numeric
- variation 2 numeric & alphabetic
- variation 3 numeric & alphabetic data.
- spacing between words is captured by filling in with blank characters.
- numeric data is generalized by turning every digit into a designated character, e.g. ‘9’. Other methods for indicating the type of as character or character position will be apparent to those skilled in the art.
- strings are generalized by turning upper case and lower case letters into designated characters, e.g. “X” and ‘x’, respectively.
- Non-alphanumeric characters are left unchanged.
- a sample of similar lines is analyzed and determination is made which words are meta-descriptors and which are data.
- the distinction is done by comparing similar lines and identifying words that are not identical. Identical words most likely carry meta-descriptors, whereas changing words carry data. The distinction could also be made based on similar words, instead of identical words, in order to neutralize errors introduced by the OCR process. A flexible matching of words would be than applied in place of the strict match. If it is possible to distinguish meta-descriptors from data then the data characters can be turned into the designated characters (X, x), otherwise the line pattern may remain at the second variation of generalization.
- FIG. 6 gives an example of a line generalized to line patterns at each of the three variations.
- Generalization at variation 1 can typically be applied with no loss of useful information because numeric strings are rarely used as meta-descriptors.
- numeric data is indicated by the numeric ‘9.’
- Generalization at variation 2 is also easily applicable, however, loss of some useful information may occur. For example, such information could be in the form of meta-descriptors that best identify data elements.
- Generalization at variation 3 can be applied only when similar lines are found in the training document and distinction between data and meta-data can be done with some certainty.
- a set of useful statistics is stored, as defined in the type definition shown in FIG. 7 .
- the statistics are calculated by comparing the source lines with the line pattern. They serve as benchmark figures in the process of recognizing document lines.
- the line pattern type definition stores information about the source of the line pattern 702 , identifies its generalization variation 704 , and stores a string 706 representing the line pattern as well as character position 708 and width 710 . Attributes like number of alpha characters 712 , digitnumeric characters 714 , other characters 716 , number of meta-words 718 , number of aligned words 720 , and matching ratios are used in the classification process.
- the character pattern distance between two lines is measured as a ratio of the number of misaligned (substituted, inserted, or deleted) characters and the number of characters in the shorter line. Character alignment is determined based on their X-positions in the lines. A character pattern distance close to zero indicates similar lines, whereas a character pattern distance above one indicates dissimilar lines. A character pattern distance between 0.1 and 0.9 represents a gray area for classification.
- the first statistic is obtained by calculating character pattern distance to a representative sample of similar lines in the training document and selecting the maximum character pattern distance. It is a measure of how dissimilar are all lines represented by the line pattern.
- the second statistic, dminAlignDistRatioNeg is obtained by calculating character pattern distance to a representative sample of dissimilar lines in the training document, and selecting the minimal character pattern distance. This is a measure of how similar might be a line that is not like the line pattern.
- edit distance is calculated (dEditDistRatio) using a known algorithm (see the following discussion on horizontal alignment).
- the benchmark character pattern distance is measured between the source line with the line pattern.
- the sampling of the similar and dissimilar lines can be conducted in the context of a particular part of the document or a record, knowing that some lines will not be classified in certain cases. For example, page header lines do not need to be sampled when classifying lines in the middle of the page.
- the record type definition consists of descriptions of the record, its layout and links to its components (e.g., the elementary data elements). Both layout and data elements are described using versatile references to lines and line sequences. For example, each record is explicitly defined by the sequence of lines containing record data elements. In addition, lines that proceed or follow the record or its parts (i.e., LSPs) are predictable due to the line order.
- a data element may be located relative to a specific line: it may start in that line, or after or before a line; however, it may never intersect some lines.
- the order of data elements in the record is dictated by the order of document lines. Two data elements may be in the same line, separated by some lines, in the previous or following line(s), not in a certain line, before or after a specific line, etc.
- LP and LSP are two features of document structure used in describing record layout and record components.
- Each part of a record is characterized by the LPs patterns within the scope of that part, as well as lists of representative line patterns of preceding and following lines (stop lists).
- the stop lists of line patterns outside the record part are meant to provide additional criteria for terminating processing of the given part when there is uncertainty about that part's continuation.
- tables are often followed by a line with totals.
- the structures of the table and totals lines are usually very similar.
- a table stop list becomes useful in order to prevent extraction of totals data as table data.
- broken records introduce some uncertainty.
- a stop list that contains potential lines from the bottom of the page prevents mistakenly extracting data from those lines and directs processing to the next page.
- Preceding and following lines are grouped in two different stop lists for moving up or down the document.
- the lines can be ordered according to the most plausible encounter. For example, when moving down the document, the first line after the given part is placed at the head of the list. In case of moving up to preceding lines, the first line before the given part is placed at the head of the list.
- FIG. 8 shows an example type definition a record structure 800 .
- Name 802 , position, and size 804 , along with record and table scope 806 provide source information about the record location and size in the training document.
- four pairs of stop lists 808 for each part of the record as well as all records on the page are declared.
- Data element type specifications describe specific fields declared in a record. Similarly to the record type specifications, data element type specifications also rely on the line patterns and their sequence in a document. As illustrated in FIG. 9 , at a minimum a data element type definition 900 stores some identification information like its index 902 , source document name 904 , position and size 906 and range of lines which it spans 908 . The remaining information can be derived from this data and the document and record models. For example, document model supplies line patterns for the lines occupied by the data element, and lines that precede and follow the element. From the location of the record and the data element, their relative placement also can be inferred.
- the data extraction process is driven in reference to the model developed in the training stage.
- Data extraction from a multi-page subject document is broken down to data extraction from single pages of the document.
- Each page is processed independently of the previous and following pages.
- the data extracted from separate pages can be reassembled into complete structures based on the order of extraction and meta descriptors associated with the extracted data.
- the independent page processing assumption makes sense in view of the discontinuities introduced by page footer and header information as well as simplifies extraction from broken records.
- each page is preprocessed by image processing (deskew, despeckle), OCR, and data horizontal alignment.
- OCR generates information about the subject document pages including character ASCII codes, XY positions, widths, and heights.
- each subject document description is enhanced by several means of indexing data at the page, line, and character level as noted earlier for the training document.
- Page processing starts with the search for the beginning of the record. If the beginning of the record is not found on the page or some number of lines were skipped in the process, then the skipped section is analyzed to find possible ending parts of the record that might have started on the previous page. If any parts of the previous record are found, then they are extracted (as described below). If there is the beginning of the new record, then its extraction is initialized.
- the record data extraction process works in cycles predetermined by the order of record parts (header, table, footer), and the order of data elements. In preferred embodiments, this order is the order in the DCH.
- the order of individual data elements within a record part is determined based on the Y (vertical) position of the beginning of given data elements.
- Each part is processed until all data elements from that part are extracted (or determined not to be present) or a break in the part is encountered, such as the start of the following part or the end of the page.
- Search for the data elements is conducted either in reference to the beginning of the record (data in the header part and the first row of the table) or in reference to already captured data elements (table data in the rows following the captured ones, footer data in reference to the last row of a table).
- Searching for a record involves obtaining from the record type definition direct and indirect information characterizing the beginning of the record. This information is provided in the form of line patterns and their relations to the beginning line.
- Line patterns may directly describe the beginning of the record or may only help in searching for the record. For example, a direct line pattern may simply correspond to the first line of the record and finding a good match guarantees successful completion of the search. Indirect line patterns may provide information about lines that are not part of the record, so lines matching such patterns should be omitted.
- the relationship between the line pattern and the searched-for element is utilized to find actual location.
- the relationship dictates the next action: should we stay in the same line or move a number of lines forward or backward.
- the moves occur with the support of the indirect information. Specifically, if the searched-for data element is located a number of lines below the reference line, according to the document model, then the current line is not only advanced to that line, but also the move is monitored by testing skipped and target lines for any irregularities.
- the tests include checking that the current record part is not broken, and determining if additional lines were inserted (this includes extra lines resulting from noisy OCR). Depending on the recognized irregularity, a proper action is taken.
- the initial width of the data element assumed from the model is subject to adjustments on both sides by expansion and contraction. Before the expansion adjustment, a measure is taken on available space between the subject document data element and characters immediately preceding and following characters in the model line. If there is sufficient space then the characters are added until the captured word is completed or there is no more space to expand. In case the width of the data element is too large, e.g. there are fewer characters in the current document than in the model, the final width will be determined based on the size of the actual characters enclosed in the approximate bounding box.
- FIG. 11 illustrates the steps during the capture process involving a data element within one line.
- Step 2 is followed by vertical expansion.
- the lines following the data element top line are verified to carry the remainder of the data element, and the bounding box is expanded accordingly.
- Step 3 contracting the size of the data element, performs both horizontal and vertical contraction.
- verification is performed after each component has been extracted from the document. In alternate embodiments, verification is performed at end of the process. Verification involves both record data elements and structures they are part of. Data element verification involves testing if the element's content matches the general description of that data element acquired from the training document and from the user, or inferred from separate or combined inputs, and stored in the document, record, and data element models. The general description may include data types and scope of valid values. The data element is assigned a confidence based on the degree of match.
- That part of the record is also verified.
- the verification involves testing for completeness of the part and the integrity of mutual relationships between involved components.
- One of the tests may involve testing geometric or spatial relationships between components. More specifically, the tests may involve comparing both vertical and horizontal alignment of the extracted data elements. Two measures are produced to reflect confidence in the extracted substructure: the number of extracted components out of the number of expected components in the part, and the number of correct relationships out of the total number of tested relationships.
- Another verification involves testing for completeness of the record.
- Final verification involves testing if the page or the document contains any unprocessed areas that could fit a record structure but for some reason no data was extracted. In case such areas exist, preferred embodiments of the invention report a list of suspected regions with the confidence level that reflects degree of fit between the record model and the region.
Abstract
A computer-implemented method for extracting information from a population of subject documents. The method includes modeling a document structure. The modeled document structure includes at least a document component hierarchy with at least one record type. Each record type includes at least one record part type and at least one record part type comprising at least one data element type. For a subject document exhibiting at least a portion of the modeled document structure, preferred embodiments of the invention identifying data of a type corresponding to at least one modeled data element type. Identified subject document data is then associated with the corresponding modeled data element type.
Description
- The present application is a continuation of, claims priority to and incorporates by reference U.S. patent application Ser. No. 10/146,959, entitled “METHOD AND SYSTEM FOR EXTRACTING INFORMATION FROM A DOCUMENT,” filed May 17, 2002.
- The present application hereby incorporates by reference in its entirety U.S. patent application Ser. No. 09/518,176, entitled “Machine Learning of Document Templates for Data Extraction,” filed Mar. 2, 2000.
- The present invention relates to methods and systems for extracting information (e.g., data with context) from a document. More specifically, preferred embodiments of the present invention relate to extraction of information from printed and imaged documents including a regular table structure.
- From credit card statements, to hospital bills, to auto repair invoices, most of us encounter printed documents containing complex, but mostly regular, data structures on a daily basis. For organizations such as businesses; the federal government; research organizations, and the like, processing data obtained in printed form from various sources and in various formats consumes substantial resources. Both manual and custom/automated solutions have been practiced. Manual solutions are highly resource-intensive and well known to be susceptible to error. Automated solutions are typically customized to a particular form and require source code changes when the subject form changes.
- The structural patterns present in such documents are naturally detectable by most of us after a brief examination. Repetitive blocks of data elements typically have a distinct appearance thought through by those who designed the document with the ostensible objective of readability. In addition, for our ability to make up for broken characters and sentences, we can also adjust for small irregularities in the layout of the data. For example, a long table listed on several pages with the sequence of table rows split by page footers and headers.
- Our understanding of language helps us in interpreting the content of such documents. For example, most of us have little trouble in distinguishing table header information from table body data. A message such as “continued on reverse side” is readily interpreted to indicate that more data is to be expected on the following page. Also, a reader would not likely confuse “71560” with a date or zip code if it is preceded by “PO BOX.”
- Our common knowledge of table structure aids us in distinguishing meta-data from data. We expect to find header information at the top of a column in cases where data descriptors do not appear immediately to the left of the data. Small print, special fonts, italics, and boldface type also make a difference in readability of documents containing tabular information. Knowledge of data formats, postal addresses, variations in date forms, meaning of names and abbreviations, spatial clues, and the combinations of these and other features help us in manual processing of documents exhibiting regular structure.
- Besides the regular and expected complexity of document and table structures, documents may pose additional challenges for automating the data extraction process. The challenges include sparse tables, tables with rows spanning a varied number of lines, parts of a row not present (missing data elements, lines), extraneous text (special printed notes or handwritten annotations), varied number of records per document page, and records broken by the end of a page. In addition to irregularities related to record structure, such as the previous ones, common problems related to scanning (e.g., skewed and rotated images), as well as OCR errors should be anticipated.
- In an illustrative example,
FIG. 1 illustrates a multi-page “claim detail section” 100 of a document broken by theend 102 ofpage 45 101. Thebreak 102 occurs in the middle of a table 104. After the unfinished table, on each page,totals 106 for the page are included. The table is continued on thenext page 103 afterpage header information 108 and an abbreviatedidentification 110 of the continued record. - Among various research fields that deal with tables are the image analysis and information extraction fields.
- Most of the image analysis methods focus on low-level graphical features to determine table segmentation. Some methods employ a line-oriented approach to table extraction. In those methods, lines or other graphical landmarks are identified to determine table cells. Other methods employ a connected component analysis approach.
- For example, in the image analysis field, a box-driven reasoning method was introduced to analyze the structure of a table that may contain noise in the form of touching characters and broken lines. See Hori, O., and Doermann, D. S., “Robust Table-form Structure Analysis Based on Box-Driven Reasoning,” ICDAR-95 Proceedings, pp. 218-221, 1995. In that method, the contours of objects are identified from original and reduced resolution images and contour bounding boxes are determined. These primary boxes and other graphical features are further analyzed to form table cells.
- Another category of image analysis approaches accepts input from optical character recognition. In one example, table structure recognition is based on textual block segmentation. Kieninger, T. G., Table Structure Recognition Based on Robust Block Segmentation,” Proceedings of SPIE, Vol. 3305, Document Recognition V, pp. 22-32, 1998. One facet of that approach is to identify words that belong to the same logical unit. It focuses on features that help word clustering into textual units. After block segmentation, row and column structure is determined by traversing margin structure. The method works well on some isolated tables, however it may also erroneously extract “table structures” from non-table regions.
- Despite many years of research toward automated information extraction from tables (and the initial step of recognizing a table in the first place), the problems have still not been solved. The automatic extraction of information is difficult for several reasons.
- Tables have many different layouts and styles. Lopresti, D., and Nagy, G., “A Tabular Survey of Automated Table Processing,” in Graphics recognition. Recent Advances, vol. 1941 of Lecture Notes in Computer Science, pp. 93-120, Springer-Verlag, Berlin, 2000. Even tables representing the same information can be arranged in many different ways. It seems that the complexity of possible table forms multiplied by the complexity of image analysis methods has worked against the production of satisfactory and practical results.
- Even though image analysis methods identify table structures and perform their segmentation, they typically do not rely on understanding about the logic of the table. This part is left to the information extraction field. In his dissertation, Hurst provides a thorough review of the current state-of-the-art in table-related research. Hurst, M. F., “The Interpretation of Tables in Texts,” PhD Thesis, 301 pages, The University of Edinburgh, 2000. Hurst notes that table extraction “has not received much attention from either the information extraction or the information retrieval communities, despite a considerable body of work in the image analysis field, psychological and educational research, and document markup and formatting research.” As possible reasons, viewed from an information extraction perspective, Hurst identifies lack of current art and model, no training corpora, and confusing markup standards. Moreover, “through the various niches of table-related research there is a lack of evolved or complex representations which are capable of relating high- and low-level aspects of tables.”
- The problem of table analysis has been approached from two extremely different directions: one that requires table understanding and another that does not require table understanding. Table understanding typically involves detection of the table logic contained in the logical relationships between the cells and meta descriptors. Meta descriptors are often explicitly enclosed in columns and stub headers or implicitly expressed elsewhere in the document. The opposite approach requires little or no understanding of the logic but focuses on the table layout and its segmentation. This dual approach to table processing is also reflected in patent descriptions.
- One group of patents concentrates on the image processing side. For example, Wang et al. in U.S. Pat. No. 5,848,186 analyzes an image to build a hierarchical tree structure for a table. The table structure is constructed as text in the table is detected and arranged in groups reflecting column and row organization. The table structure emerges to some degree but there is no effort to attach any functionality to the extracted groups of texts. Wang, S-Y., and Yagasaki, T., “Feature Extraction System for Identifying Text Within a Table Image,” U.S. Pat. No. 5,848,186, Dec. 8, 1998.
- Another example of a patent with the focus on image processing is one by Mahoney in U.S. Pat. No. 6,009,196. Mahoney, J. V., “Method for Classifying Non-running Text in an Image,” U.S. Pat. No. 6,009,196, December 1999. A stated objective of that patent is to provide classification of document regions “as text, a horizontal sequence, a vertical sequence, or a table.” The method does not appear to perform any data extraction.
- A second group of patents concentrates on retrieving tabular data from textual sources. In general, graphical representation of the document is ignored and what counts is mainly text including blanks between texts. For example, in patent 5,950,196 by Pyreddy, table components, such as table lines, caption lines, row headings, and column headings are identified and extracted from textual sources. Pyreddy, P., and Croft, B., “Systems and Methods for Retrieving Tabular Data from Textual Sources,” U.S. Pat. No. 5,950,196, September 1999. The system may produce satisfactory results with regard to the data granularity required for human queries and interpretation. However, it would not likely be applicable for database upload applications.
- One approach that appears to be missing from the references is to exploit the synergy between our intuitive understanding of documents and advances in image processing and information retrieval. Using a user's input to indicate structural features and a computer's processing power to search out and extract data from such structures offers a promising approach to information extraction from documents exhibiting regular data structures.
- In a preferred embodiment, the invention includes a computer-implemented method for extracting information from a population of subject documents. In that embodiment, the method includes modeling a document structure. The modeled document structure includes at least a document component hierarchy with at least one record type. Each record type includes at least one record part type and at least one record part type comprising at least one data element type. For a subject document exhibiting at least a portion of the modeled document structure, preferred embodiments of the invention identify data of a type corresponding to at least one modeled data element type. Identified subject document data is then associated with the corresponding modeled data element type.
- In another preferred embodiment, the invention includes a method for horizontally aligning a first region of a document with a second region of a document where each region is characterized by a plurality of sub-regions. This embodiment includes determining a type for each of a plurality of sub-regions in each region and then determining an edit distance for each typed first region sub-region, typed second region sub-region pair. A first sub-region offset is calculated for those pairs characterized by an edit distance not greater than a threshold. A first region offset is determined as a function of the individual first region sub-region offsets. In a particular embodiment, regions correspond to pages and sub-regions correspond to lines.
- Preferred embodiments of the present invention are shown by example and not limitation in the accompanying Figures, in which:
-
FIG. 1 is an example of a long record, in accordance with a preferred embodiment of the present invention, broken by the end of a page; -
FIG. 2 is an example of a record consisting of a header, a table, and a footer, in accordance with a preferred embodiment of the present invention; -
FIG. 3 is an example of a document page with structural patterns decomposed in to three different records in accordance with a preferred embodiment of the present invention; -
FIG. 4 is an example of data element selection in accordance with a preferred embodiment of the present invention; -
FIG. 5 is another example of data element selection in accordance with a preferred embodiment of the present invention, including meta-data indicated for extraction; -
FIG. 6 illustrates variations of generalizing a line pattern in accordance with preferred embodiments of the present invention; -
FIG. 7 illustrates a line pattern data structure in accordance with preferred embodiments of the present invention; -
FIG. 8 illustrates a record data structure in accordance with preferred embodiments of the present invention; -
FIG. 9 illustrates a data element data structure in accordance with preferred embodiments of the present invention; -
FIG. 10 illustrates horizontal offset between two aligning line patterns, in accordance with preferred embodiments of the present invention; and -
FIG. 11 illustrates steps in data element capture in accordance with preferred embodiments of the present invention. - As required, detailed embodiments of the present invention are disclosed herein. However, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale, and some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention.
- A wide variety of printed documents exhibit complex data structures including textual or numerical data organized in rows and columns, i.e., tables, along with more general structures, one of which may be broadly characterized as consisting of one or more contextually-related elements including possibly tables, i.e., records. Contextual relationship typically binds the data elements together in such structures. The relationship can be communicated by characteristics, such as sequence of occurrence, spatial distribution, and landmark (e.g., keyword, graphical reference) association. In other words, data organized in tables, and more generally in records, presents itself as structural patterns recognizable by detecting the above characteristics, among others.
- Preferred embodiments of the present invention employ a paradigmatic structure type, i.e., the Flexrecord, to represent structural patterns typically found, for example, in documents such as credit card bills and insurance payment summaries. The Flexrecord type definition includes up to three component types: header, table, and footer. The header includes the set of data elements within a record that are in a specified proximity to the beginning of the record. The table includes the repetitive, variable-length parts of the record. If a record contains a table, a footer may follow as the third component part type of a Flexrecord. The footer includes the set of data between the end of the table and the end of the record.
FIG. 2 (1) gives an example of arecord 200. Lines 1-5 contain aheader 202, lines 6-8 contain a table 204, and lines 9-11 contain afooter 206. Given the three components of the record, five variants of records are noted: - Header+table+footer
- Header+table
- Header
- Table+footer
- Table
- Since footer must appear after the varying part of the record (the table), two combinations (header+footer) and (footer) are not considered. They are just treated as a header.
- The complexity of some documents can be addressed by defining more than one record for that document.
FIG. 3 shows an example of three different records describing various parts of adocument page 300. Thefirst record 310 captures information from the page header. In fact, a Flexrecord header part is sufficient to describe all data elements contained there. Tworecords page 320. Theserecords header 325, table 327, andfooter 329. The bottom part of thepage 330 can be considered a header part of another record. This combination of records decomposes the problem of data extraction from the document consisting of three structural patterns. - Preferred embodiments of the present invention implement data extraction from complex document structures in two broad steps: training and extraction.
- During the training step, a model of the paradigmatic document structure for a population of subject documents is constructed using three inputs: a document component hierarchy (DCH) in the manner of declarations of one or more Flexrecords, spatial and character data from a training document, and information supplied by a user relating the spatial and character data to the DCH hierarchy.
- The DCH is built based on a record declaration provided by a user. In preferred embodiments, a user decides the number of records, makes a selection of elementary data elements in each part of the record, assigns a name to each data element, outlines one or more whole records and the table part of each, and saves descriptions in an external file. For example, a record template that reflects the requirement for defining various structural and elementary data elements might be encoded in the following way:
- R{number}.{part} {name}
- where, “R” is an arbitrary letter indicating the record definition, {number} is a number identifying the record,“.” is a separator, {part}={{“H”, “ ”}, “T”, “F”} is record part indicator, and {name} is a specific name assigned to the data element.
- A specific definition of record R1 could consist of the following descriptors:
- R1—declaring the whole record structure
- R1.T—declaring the whole table part of the flex-record
- R1.provider, R1.account, R1.group—declaring single data elements in the header part
- R1.Tdeductible, R1.Tservice_dates—declaring single data elements in the table part
- R1.Ftotal_deductible, R1.Fbalance_due—declaring single data elements in the footer part
- The spatial distribution of characters and text content of the training document includes character and positional data obtained from Optical Character Recognition (OCR). Each recognizable character in the document is described by its ASCII code, XY position, width, and height. In preferred embodiments of the invention, a document is characterized at several levels of abstraction, e.g., page, line, word, and character levels using data beyond that gathered by conventional OCR.
- The page level includes descriptions of all pages included in the document. These include page dimensions and margin offsets. The line level includes total number of document lines; vertical offsets to line bases from the beginning of the document, and numbers of words in each line. An OCR engine predetermines division of a document into lines. The word level includes the total number of document words, horizontal offsets to words' start and end as measured from the line beginning, and number of characters in each word. The character level includes the total number of document characters, horizontal offsets to characters, their widths and heights, as well as the complete document text. The text includes words separated by single spaces; multiple spaces are compressed.
- One operation performed in preferred embodiments of the present invention, in order to facilitate accurate measurements, is data horizontal alignment. One source of misaligned data is misaligned pages affected by the scanning or imaging processes. Data alignment also may be needed because data on printed (or electronic) copies may be laid out differently in the original document production system. For instance, odd and even pages may have different margins on the left side.
- Data alignment is used in both the training and extraction process of preferred embodiments of the present invention. During training several benchmark statistics over various document and record parts that can involve different pages are gathered. For example, when calculating deviation in distances between similar or dissimilar line patterns the characters in the evaluated lines should be properly aligned. Otherwise, misalignment by a width of one character may produce erroneous data and lead to incorrect decisions.
- The data alignment algorithm is based on the same assumption as is the source of the structural patterns: the repeatability of the data and meta-data in the documents of the same type. In other words, if the given document type contains structural data then it should be possible to align data using the algorithm presented below. Data alignment is different from typical page or form alignment that is based on the assumption that there exist so-called registration marks on pages or forms. Such marks may be preprinted on the page or a fragment of a form, or in some instances, graphics (logos, form lines) can be used for registration.
- In order to align any two pages of a document, a fixed number of the most similar line patterns from both pages are collected. For each pair, a horizontal offset is calculated. The offset is measured between correlated character positions that may involve original or generalized characters (
FIG. 10 ). Final offset between the two pages is calculated as an average offset between the largest cluster of similar offsets. - Finding the most similar line patterns involves three processing steps. First, line patterns are generated using generalization variation 2 (numeric and alphabetic characters). Generalization is described in detail below. Next, similarity between pairs of lines is found using an algorithm measuring distance between two strings. The algorithm penalizes each character insertion, deletion, and substitution with an equal cost. Similar line patterns have low ratio (close to 0) of the distance and the length of the smaller line pattern. Finally, the most similar pairs of line patterns are selected as those with the lowest ratio that is below some dissimilarity threshold.
- The algorithm used in the process of determining which line patterns are similar does not involve character positional data because this data is not completely reliable due to the misaligned pages. The algorithm calculates edit distance by the dynamic programming techniques as known to those skilled in the art. Given two strings, source string s1s2 . . . sn, and target string t1t2 . . . tm, the algorithm finds distance dn,m. Intermediate distances di,j are defined as follows:
- d0,0=0
- di,0=di−1,0+delete cost
- d0,j=d0,j−1+insert cost
- di,j=min {(di−1,j+delete cost), (di,j−1+insert cost), (if (si==tj) then 0 else di−1,j−1+substitution cost)}
- It has been empirically determined that the number of offsets (votes) in the largest cluster should be at least half of all the similar pairs. Due to the complexity of the string alignment algorithm, it is practical to stop searching for the best aligning pairs of line patterns once about fifteen of them have been collected. The larger the overall number of votes, and the larger the ratio between the number of votes in the best cluster and all the votes, the better is the confidence in the data alignment.
- Preferred embodiments of the invention leverage a user's ability to distinguish relevant parts of a document from irrelevant parts. The training step information performed by the user involves indicating the structure of one or more records and actual type of contents for extraction, as well as naming the particular data element types that have been indicated. This step links the spatial distribution of characters and the text content to the document component hierarchy (DCH).
- The structure and the scope of the record, its various parts, and specific data elements selected for extraction are indicated by drawing bounding boxes around the components. The contents of the record are indicated by drawing a box around the whole record. The table part is indicated by drawing a box around the table contents. This action also implicitly separates the header and the footer parts. Each particular data element or structure is indicated by a separate bounding box and a name is selected from a predefined set of names.
- Data elements selected for extraction from the header and footer are outlined and named. In the table part, data elements of only one row need to be defined. In case of a missing column entry, the user may draw an empty box as a column data placeholder. Among many rows in the given table, any row can be selected for defining table data elements; as long as the row is representative of the row type. Any subset of columns can be selected for extraction.
- While some embodiments of the invention employ only record and data element specification (foregoing user-specified or -indicated record parts), an advantage of other embodiments in which the user specifies record parts in the DCH and indicates at least the table (the line-to-line repeating part of a record) as a record part is that the user can typically recognize the table more reliably than can a program executing on a computer. This is especially true where not every entry in the table contains data.
- Referring to the
record 400 illustrated inFIG. 4 , the process of a record definition will be described. Two structural parts are outlined: the complete record (R1) 402, and the table inside the record (R1.T) 404. Four (4)data elements 406 are identified within theheader 408. For illustrative purposes, the data element ANX Center is shown labeled asR1.Provider 410. The remaining definitions pertain to specific data elements selected for extraction. All data elements selected in the header and footer are outlined and given names. (For picture clarity, only a few names are shown). In thetable R1.T 404, data elements for service dates 412,procedure 414, deductible 416, andpayment 418 are indicated. As an example,FIG. 4 shows a deductible 416 data element labeled asR1.T.Deductible 420. Likewise, the figure shows data element 2010.00 430 selected and identified asR1.F.Balance 432 in footer R1.F (unlabeled) 440.Additional footer 440 elements are indicated for deductible total (i.e., 167.50) 434, payment total (i.e., 3010.00) 436, and total paid (e.g., 1000.00) 438. “Group” 440 and “inventory” 442 fields were not selected for extraction at this time but they can be added to the model at any time later. - There are a huge number of possible selections for data extraction. For example, a record with 20 data elements can be defined in over a million ways! This number is even larger because preferred embodiments of the present invention allow for grouping of data elements, as well as extracting data with or without descriptors. Provision for meta-data extraction gives an additional flexibility in post-processing of extracted data. However, there is some risk involved in relying on extracted descriptors/metadata. The extracted descriptors may include OCR errors, whereas the names assigned to data elements always will not be as susceptible to that source of error.
-
FIG. 5 presents some alternative ways of defining data elements for extraction. Theunderlying record 500 inFIG. 5 is identical to therecord 400 illustrated inFIG. 4 . The data elements and corresponding descriptors for Name and Account # are identified as asingle data element 510. The two data elements in the header first line are combined and extracted as one line with their descriptors as part of one data element.Data elements descriptors descriptors Provider 532 andInventory # 534 from the third line are not extracted, but thedata elements - Given the three inputs discussed above, preferred embodiments of the present invention round out the model of the document structure by deriving type definitions for the whole document, each line pattern, each record, and each data element. The document definition contains descriptions that enable recognition of document and record parts and navigation through the document.
- In the domain of documents that contain flex-records, we find a “line” of characters a particularly useful unit in document processing. Lines exhibit easily recognizable visual patterns. Spacing between the columns of a table along with column data types clearly identify the table part of a record. Indentation, spacing, meta-descriptors, sequence, and content itself identify other parts of the record or the document. The document model stores generalized line descriptions as line patterns (LP). Generalization of a line description corresponds to identifying the type of character at each position in the line.
- Another pattern exploited in the data extraction of structural patterns is the line sequence pattern (LSP). This pattern is maintained by storing the order of line patterns (in preferred embodiments, line patterns are typically stored in the record type definition they are associated with). LPs and LSPs are useful in recognizing and processing various parts of the document, especially enclosed records. There may be gaps between line patterns in the sequence. The gap extent can be estimated based on the training document(s).
- Depending on the richness and availability of the source data, a line pattern may be created in at least three variations:
variation 1—numeric;variation 2—numeric & alphabetic;variation 3—numeric & alphabetic data. In each variation, spacing between words is captured by filling in with blank characters. In addition, at each variation, numeric data is generalized by turning every digit into a designated character, e.g. ‘9’. Other methods for indicating the type of as character or character position will be apparent to those skilled in the art. - At the second variation, strings are generalized by turning upper case and lower case letters into designated characters, e.g. “X” and ‘x’, respectively. Non-alphanumeric characters are left unchanged.
- In order to generate a line pattern at the third variation, a sample of similar lines is analyzed and determination is made which words are meta-descriptors and which are data. The distinction is done by comparing similar lines and identifying words that are not identical. Identical words most likely carry meta-descriptors, whereas changing words carry data. The distinction could also be made based on similar words, instead of identical words, in order to neutralize errors introduced by the OCR process. A flexible matching of words would be than applied in place of the strict match. If it is possible to distinguish meta-descriptors from data then the data characters can be turned into the designated characters (X, x), otherwise the line pattern may remain at the second variation of generalization.
-
FIG. 6 gives an example of a line generalized to line patterns at each of the three variations. In order to generalize atvariations variation 1 can typically be applied with no loss of useful information because numeric strings are rarely used as meta-descriptors. InFIG. 6 forvariation 1 601, numeric data is indicated by the numeric ‘9.’ Generalization atvariation 2 is also easily applicable, however, loss of some useful information may occur. For example, such information could be in the form of meta-descriptors that best identify data elements. Generalization atvariation 3 can be applied only when similar lines are found in the training document and distinction between data and meta-data can be done with some certainty. - For each line pattern, a set of useful statistics is stored, as defined in the type definition shown in
FIG. 7 . The statistics are calculated by comparing the source lines with the line pattern. They serve as benchmark figures in the process of recognizing document lines. The line pattern type definition stores information about the source of theline pattern 702, identifies itsgeneralization variation 704, and stores astring 706 representing the line pattern as well ascharacter position 708 andwidth 710. Attributes like number ofalpha characters 712,digitnumeric characters 714,other characters 716, number of meta-words 718, number of aligned words 720, and matching ratios are used in the classification process. - In preferred embodiments of the present invention, the character pattern distance between two lines is measured as a ratio of the number of misaligned (substituted, inserted, or deleted) characters and the number of characters in the shorter line. Character alignment is determined based on their X-positions in the lines. A character pattern distance close to zero indicates similar lines, whereas a character pattern distance above one indicates dissimilar lines. A character pattern distance between 0.1 and 0.9 represents a gray area for classification.
- In this case, context and two reference statistics from line pattern structure become more useful. The first statistic, dmaxAlignDistRatioPos, is obtained by calculating character pattern distance to a representative sample of similar lines in the training document and selecting the maximum character pattern distance. It is a measure of how dissimilar are all lines represented by the line pattern. The second statistic, dminAlignDistRatioNeg, is obtained by calculating character pattern distance to a representative sample of dissimilar lines in the training document, and selecting the minimal character pattern distance. This is a measure of how similar might be a line that is not like the line pattern. In cases when it is difficult to establish alignment between lines, or the alignment is not reliable, edit distance is calculated (dEditDistRatio) using a known algorithm (see the following discussion on horizontal alignment). The benchmark character pattern distance is measured between the source line with the line pattern.
- The sampling of the similar and dissimilar lines can be conducted in the context of a particular part of the document or a record, knowing that some lines will not be classified in certain cases. For example, page header lines do not need to be sampled when classifying lines in the middle of the page.
- The record type definition consists of descriptions of the record, its layout and links to its components (e.g., the elementary data elements). Both layout and data elements are described using versatile references to lines and line sequences. For example, each record is explicitly defined by the sequence of lines containing record data elements. In addition, lines that proceed or follow the record or its parts (i.e., LSPs) are predictable due to the line order.
- A data element may be located relative to a specific line: it may start in that line, or after or before a line; however, it may never intersect some lines. The order of data elements in the record is dictated by the order of document lines. Two data elements may be in the same line, separated by some lines, in the previous or following line(s), not in a certain line, before or after a specific line, etc.
- These concepts are reflected in each record type definition. LP and LSP are two features of document structure used in describing record layout and record components. Each part of a record is characterized by the LPs patterns within the scope of that part, as well as lists of representative line patterns of preceding and following lines (stop lists). The stop lists of line patterns outside the record part are meant to provide additional criteria for terminating processing of the given part when there is uncertainty about that part's continuation.
- For example, tables are often followed by a line with totals. The structures of the table and totals lines are usually very similar. A table stop list becomes useful in order to prevent extraction of totals data as table data. In a different case, broken records introduce some uncertainty. A stop list that contains potential lines from the bottom of the page prevents mistakenly extracting data from those lines and directs processing to the next page.
- Preceding and following lines are grouped in two different stop lists for moving up or down the document. The lines can be ordered according to the most plausible encounter. For example, when moving down the document, the first line after the given part is placed at the head of the list. In case of moving up to preceding lines, the first line before the given part is placed at the head of the list.
-
FIG. 8 shows an example type definition arecord structure 800. Name 802, position, andsize 804, along with record andtable scope 806 provide source information about the record location and size in the training document. Next, four pairs ofstop lists 808 for each part of the record as well as all records on the page are declared. - Data element type specifications describe specific fields declared in a record. Similarly to the record type specifications, data element type specifications also rely on the line patterns and their sequence in a document. As illustrated in
FIG. 9 , at a minimum a dataelement type definition 900 stores some identification information like itsindex 902,source document name 904, position andsize 906 and range of lines which it spans 908. The remaining information can be derived from this data and the document and record models. For example, document model supplies line patterns for the lines occupied by the data element, and lines that precede and follow the element. From the location of the record and the data element, their relative placement also can be inferred. - The data extraction process is driven in reference to the model developed in the training stage. Data extraction from a multi-page subject document is broken down to data extraction from single pages of the document. Each page is processed independently of the previous and following pages. The data extracted from separate pages can be reassembled into complete structures based on the order of extraction and meta descriptors associated with the extracted data. The independent page processing assumption makes sense in view of the discontinuities introduced by page footer and header information as well as simplifies extraction from broken records.
- Before the search and capture of data elements from a subject is performed, each page is preprocessed by image processing (deskew, despeckle), OCR, and data horizontal alignment. OCR generates information about the subject document pages including character ASCII codes, XY positions, widths, and heights. In addition to OCR, each subject document description is enhanced by several means of indexing data at the page, line, and character level as noted earlier for the training document.
- Page processing, in preferred embodiments, starts with the search for the beginning of the record. If the beginning of the record is not found on the page or some number of lines were skipped in the process, then the skipped section is analyzed to find possible ending parts of the record that might have started on the previous page. If any parts of the previous record are found, then they are extracted (as described below). If there is the beginning of the new record, then its extraction is initialized.
- The record data extraction process works in cycles predetermined by the order of record parts (header, table, footer), and the order of data elements. In preferred embodiments, this order is the order in the DCH. The order of individual data elements within a record part is determined based on the Y (vertical) position of the beginning of given data elements.
- Each part is processed until all data elements from that part are extracted (or determined not to be present) or a break in the part is encountered, such as the start of the following part or the end of the page. There are two elementary phases involved in the data element extraction process: searching for the data element and its capture.
- Search for the data elements is conducted either in reference to the beginning of the record (data in the header part and the first row of the table) or in reference to already captured data elements (table data in the rows following the captured ones, footer data in reference to the last row of a table). Searching for a record involves obtaining from the record type definition direct and indirect information characterizing the beginning of the record. This information is provided in the form of line patterns and their relations to the beginning line.
- Line patterns may directly describe the beginning of the record or may only help in searching for the record. For example, a direct line pattern may simply correspond to the first line of the record and finding a good match guarantees successful completion of the search. Indirect line patterns may provide information about lines that are not part of the record, so lines matching such patterns should be omitted.
- Once a reference line is determined, the relationship between the line pattern and the searched-for element is utilized to find actual location. In particular, the relationship dictates the next action: should we stay in the same line or move a number of lines forward or backward. The moves occur with the support of the indirect information. Specifically, if the searched-for data element is located a number of lines below the reference line, according to the document model, then the current line is not only advanced to that line, but also the move is monitored by testing skipped and target lines for any irregularities.
- The tests include checking that the current record part is not broken, and determining if additional lines were inserted (this includes extra lines resulting from noisy OCR). Depending on the recognized irregularity, a proper action is taken.
- In most cases, the line including the searched-for subject document data element will be found, and the data capture may be initialized. Note that *finding* a data element does not require an exact match to the criteria; typically a strong correspondence will suffice. This starts with determining the horizontal (x) position of the beginning of the data element. Horizontal position data is readily available from the data element type specification's (x, y, w, h) description of the data element adjusted for the horizontal offset of the current page.
- The initial width of the data element assumed from the model is subject to adjustments on both sides by expansion and contraction. Before the expansion adjustment, a measure is taken on available space between the subject document data element and characters immediately preceding and following characters in the model line. If there is sufficient space then the characters are added until the captured word is completed or there is no more space to expand. In case the width of the data element is too large, e.g. there are fewer characters in the current document than in the model, the final width will be determined based on the size of the actual characters enclosed in the approximate bounding box.
-
FIG. 11 illustrates the steps during the capture process involving a data element within one line. In the case of data elements spanning multiple lines,Step 2 is followed by vertical expansion. The lines following the data element top line are verified to carry the remainder of the data element, and the bounding box is expanded accordingly.Step 3, contracting the size of the data element, performs both horizontal and vertical contraction. - In preferred embodiments of the invention, verification is performed after each component has been extracted from the document. In alternate embodiments, verification is performed at end of the process. Verification involves both record data elements and structures they are part of. Data element verification involves testing if the element's content matches the general description of that data element acquired from the training document and from the user, or inferred from separate or combined inputs, and stored in the document, record, and data element models. The general description may include data types and scope of valid values. The data element is assigned a confidence based on the degree of match.
- In preferred embodiments, after extracting a part of the record, such as header, footer, or a row from the table, that part of the record is also verified. The verification involves testing for completeness of the part and the integrity of mutual relationships between involved components. One of the tests may involve testing geometric or spatial relationships between components. More specifically, the tests may involve comparing both vertical and horizontal alignment of the extracted data elements. Two measures are produced to reflect confidence in the extracted substructure: the number of extracted components out of the number of expected components in the part, and the number of correct relationships out of the total number of tested relationships.
- Another verification involves testing for completeness of the record. Final verification involves testing if the page or the document contains any unprocessed areas that could fit a record structure but for some reason no data was extracted. In case such areas exist, preferred embodiments of the invention report a list of suspected regions with the confidence level that reflects degree of fit between the record model and the region.
Claims (17)
1. A computer-implemented method for extracting information from a population of one or more subject documents, the method comprising:
modeling a document structure representative of the population,
the modeled document structure comprising a document component hierarchy,
the document component hierarchy comprising
at least one record type,
each record type comprising at least one record part type, and
at least one record part type comprising at least one data element type;
for a subject document exhibiting at least a portion of the modeled document structure,
identifying subject document data of a type corresponding to at least one modeled data element type and
associating the identified data with the corresponding modeled data element type;
wherein modeling a document structure further comprises, in a computer:
obtaining an imaged and recognized training document;
accepting from a user: information regarding the document component hierarchy, and information regarding the relationship between at least one element of the document component hierarchy and a corresponding portion of the imaged and recognized training document;
specifying at least one data element type of the data component hierarchy, and its corresponding record type and line pattern, based on at least the imaged and recognized training document, the document component hierarchy, and the association information, the line pattern including the identification of the type of character at each position in the line;
wherein the modeled document structure further comprises the imaged and recognized training document, each specified data element type, each specified line pattern type, and each specified record type; and
wherein identifying data of a type corresponding to at least one specified data element type comprises, in a computer:
imaging the subject document;
recognizing characters of the imaged subject document;
locating, within the imaged and recognized subject document, at least one data element of a type specified in the document model
capturing at least one data element found to be of a type specified in the document model.
2. A computer-implemented method for extracting information from a population of one or more subject documents, the method comprising:
modeling a document structure representative of the population,
the modeled document structure comprising a document component hierarchy,
the document component hierarchy comprising
at least one record type,
each record type comprising at least one record part type, and
at least one record part type comprising at least one data element type;
for a subject document exhibiting at least a portion of the modeled document structure,
identifying subject document data of a type corresponding to at least one modeled data element type and
associating the identified data with the corresponding modeled data element type;
wherein modeling a document structure further comprises, in a computer:
imaging a training document;
recognizing characters of the imaged training document;
accepting, from a user: information regarding the document component hierarchy, and information regarding the relationship between at least one element of the document component hierarchy and a corresponding portion of the imaged and recognized training document;
specifying at least one data element type of the data component hierarchy, and its corresponding record type and line pattern, based on at least the imaged and recognized training document, the document component hierarchy, and the association information, the line pattern including the identification of the type of character at each position in the line;
wherein the modeled document structure further comprises the imaged and recognized training document, each specified data element type, each specified line pattern type, and each specified record type; and
wherein identifying data of a type corresponding to at least one specified data element type comprises, in a computer:
imaging the subject document;
recognizing characters of the imaged subject document;
locating, within the imaged and recognized subject document, at least one data element of a type specified in the document model
capturing at least one data element found to be of a type specified in the document model.
3. The method as in claim 2:
wherein identifying data of a type corresponding to at least one specified data element type further comprises, in a computer determining a subject document line pattern for each subject document line;
wherein a line sequence pattern comprises a plurality of ordered line patterns;
wherein locating, within the imaged and recognized subject document, at least one data element of a type specified in the document model comprises:
for each record type in the document model containing a data element of interest,
locating, within the subject document, at least one of a line pattern and a line sequence pattern that corresponds to the record type,
locating the data element of interest based on its specified relationship to the structure of the record type.
4. The method as in claim 2 wherein recognizing characters of an imaged training document comprises:
performing optical character recognition (OCR) on the training document.
5. The method as in claim 2 wherein imaging a training document comprises at least one of the following:
deskewing, the training document image,
despecklizing the training document image,
horizontally aligning a target page of the training document with reference to a base page of the training document.
6. The method as in claim 5 wherein horizontally aligning one region of the training document with reference to a base region of the training document comprises:
generalizing a plurality of lines in the target page and the base page;
determining an edit distance for each generalized target page line, generalized base page line pair;
determining a target page line offset for those pairs characterized by an edit distance not greater than a threshold;
determining a target page offset as a function of the target page line offsets, and
offsetting the target page by the offset.
7. The method as in claim 2 further comprising:
assessing the degree to which the captured data element corresponds to the data element type.
8. The method as in claim 2 further comprising:
assessing the degree to which the subject document structure surrounding each identified subject document data element corresponds to the modeled document structure.
9. The method as in claim 8 wherein assessing the degree to which the subject document structure surrounding each identified subject document data element corresponds to the modeled document structure comprises:
determining a ratio of the number of data elements extracted from a subject document record and the number of data elements specified for a record of that type.
10. The method as in claim 8 wherein assessing the degree to which the subject document structure surrounding each identified subject document data element corresponds to the modeled document structure comprises:
comparing the horizontal and vertical alignment of record, record parts, and data elements of the subject document against the horizontal and vertical alignment of the corresponding types in the modeled document.
11. The method as in claim 8 wherein assessing the degree to which the subject document structure surrounding each identified subject document data element corresponds to the modeled document structure comprises:
determining a ratio of the number of data elements extracted from a subject document record part and the number of data elements specified for a record part of that type.
12. The method as in claim 2:
wherein capturing at least one data element found to be of a type specified in the document model comprises:
adjusting a capture window boundary to the dimensions of the actual data element in the subject document.
13. A computer program product for extracting information from a population of subject documents, the computer program product comprising:
a computer-readable medium;
a modeling module stored on the medium and operative to model a document structure,
the modeled document structure comprising a document component hierarchy,
the document component hierarchy comprising
at least one record type,
each record type comprising at least one record part type and
at least one record part type comprising at least one data element type;
an identification module, operative to identify subject document data of a type corresponding to at least one modeled data element type, and
an association module, operative to associate the identified data with the corresponding modeled data element type;
wherein the modeling module further comprises:
a imaging and character recognition module, operative to obtain the image and characters of a training document;
a user interface module, operative to prompt for and accept information from a user, regarding the document component hierarchy and regarding the relationship between at lease one element of the document component hierarchy and a corresponding portion of the imaged and recognized training document;
a specification module, operative to specify at least one data element type of the data component hierarchy, and its corresponding record type and line pattern, based on at least the imaged and recognized training document, the document component hierarchy, and the association information, the line pattern including the identification of the type of character at each position in the line;
wherein the modeled document structure further comprises the imaged and recognized training document, each specified data element type, each specified line pattern type, and each specified record type; and
wherein the identification module further comprises:
an imaging and character recognition module, operative to image and recognize the characters of the subject document;
a locating module, operative to locate, within the imaged and recognized subject document, at least one data element of a type specified in the document model; and
a capture module, operative to capture at least one data element found to be of a type specified in the document model.
14. The computer program product as in claim 13:
wherein the subject document imaging and character recognition module is further operative to determine a subject document line pattern for each subject document line;
wherein a line sequence pattern comprises a plurality of ordered line patterns
wherein the locating module is further operative to:
for each record type in the document model containing a data element of interest,
locate, within the subject document, at least one of a line pattern and a line sequence pattern that corresponds to the record type,
locate the data element of interest based on its specified relationship to the structure of the record type.
15. A computer-implemented method for modeling a document structure, the method comprising:
imaging a training document;
recognizing characters of the imaged training document;
accepting, from a user: information regarding the document component hierarchy, and information regarding the relationship between at least one element of the document component hierarchy and a corresponding portion of the imaged and recognized training document;
specifying at least one data element type of the data component hierarchy, and its corresponding record type and line pattern, based on at least the imaged and recognized training document, the document component hierarchy, and the association information, the line pattern including the identification of the type of character at each position in the line;
wherein the modeled document structure further comprises the imaged and recognized training document, each specified data element type, each specified line pattern type, and each specified record type.
16. A computer implemented method for aligning at least two pages of a document comprising:
collecting a predetermined number of similar line pattern pairs, one line pattern of each pair being from each of the at least two pages, wherein a line pattern includes the identification of the type of character at each position in the line;
calculating a horizontal offset for each pair by measuring between correlated character positions;
determining similar offsets among multiple pairs to form a cluster of similar offsets;
calculating an average offset for the similar offsets in the cluster.
17. The method of claim 16 , wherein determining the similarity between line patterns includes:
generating line patterns;
identifying strings within each of the line patterns;
determining the length of the strings within the line patterns;
calculating the distance between the strings in each of the line patterns;
comparing the calculated distances among line patterns to determine similar line patterns.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/544,693 US20070053611A1 (en) | 2002-05-17 | 2006-10-10 | Method and system for extracting information from a document |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/146,959 US7142728B2 (en) | 2002-05-17 | 2002-05-17 | Method and system for extracting information from a document |
US11/544,693 US20070053611A1 (en) | 2002-05-17 | 2006-10-10 | Method and system for extracting information from a document |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/146,959 Continuation US7142728B2 (en) | 2002-05-17 | 2002-05-17 | Method and system for extracting information from a document |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/822,697 Continuation US20100280053A1 (en) | 2003-02-07 | 2010-06-24 | CyclopropylmethYl-[7-(5,7-dimethyl-benzo[1,2,5]thiodiazol-4-yl)-2,5,6-trimethyl-7H-pyrrolo[2,3-d]pyrimidin-4-yl]-4-propyl-amine as a CRF Antagonist |
Publications (1)
Publication Number | Publication Date |
---|---|
US20070053611A1 true US20070053611A1 (en) | 2007-03-08 |
Family
ID=29418922
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/146,959 Expired - Fee Related US7142728B2 (en) | 2002-05-17 | 2002-05-17 | Method and system for extracting information from a document |
US11/544,693 Abandoned US20070053611A1 (en) | 2002-05-17 | 2006-10-10 | Method and system for extracting information from a document |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/146,959 Expired - Fee Related US7142728B2 (en) | 2002-05-17 | 2002-05-17 | Method and system for extracting information from a document |
Country Status (1)
Country | Link |
---|---|
US (2) | US7142728B2 (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070168382A1 (en) * | 2006-01-03 | 2007-07-19 | Michael Tillberg | Document analysis system for integration of paper records into a searchable electronic database |
US20080212845A1 (en) * | 2007-02-26 | 2008-09-04 | Emc Corporation | Automatic form generation |
US20100238474A1 (en) * | 2009-03-17 | 2010-09-23 | Konica Minolta Business Technologies, Inc. | Document image processing apparatus, document image processing method, and computer-readable recording medium having recorded document image processing program |
US20110010348A1 (en) * | 2009-07-09 | 2011-01-13 | International Business Machines Corporation | Rule-based record profiles to automate record declaration of electronic documents |
US20130167018A1 (en) * | 2011-12-21 | 2013-06-27 | Beijing Founder Apabi Technology Ltd. | Methods and Devices for Extracting Document Structure |
US20140369602A1 (en) * | 2013-06-14 | 2014-12-18 | Lexmark International Technology S.A. | Methods for Automatic Structured Extraction of Data in OCR Documents Having Tabular Data |
FR3012062A1 (en) * | 2013-10-18 | 2015-04-24 | Gerlon Sa | SANDING EQUIPMENT |
US9639900B2 (en) | 2013-02-28 | 2017-05-02 | Intuit Inc. | Systems and methods for tax data capture and use |
US9727804B1 (en) * | 2005-04-15 | 2017-08-08 | Matrox Electronic Systems, Ltd. | Method of correcting strings |
AU2013379776B2 (en) * | 2013-02-28 | 2017-08-24 | Intuit Inc. | Presentation of image of source of tax data through tax preparation application |
US20190034399A1 (en) * | 2015-06-30 | 2019-01-31 | Datawatch Corporation | Systems and methods for automatically creating tables using auto-generated templates |
US10204095B1 (en) * | 2015-02-10 | 2019-02-12 | West Corporation | Processing and delivery of private electronic documents |
US10438083B1 (en) | 2016-09-27 | 2019-10-08 | Matrox Electronic Systems Ltd. | Method and system for processing candidate strings generated by an optical character recognition process |
US10713524B2 (en) * | 2018-10-10 | 2020-07-14 | Microsoft Technology Licensing, Llc | Key value extraction from documents |
US10878516B2 (en) | 2013-02-28 | 2020-12-29 | Intuit Inc. | Tax document imaging and processing |
US20220245377A1 (en) * | 2021-01-29 | 2022-08-04 | Intuit Inc. | Automated text information extraction from electronic documents |
Families Citing this family (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030233619A1 (en) * | 2002-05-30 | 2003-12-18 | Fast Bruce Brian | Process for locating data fields on electronic images of complex-structured forms or documents |
WO2004030532A1 (en) * | 2002-10-03 | 2004-04-15 | The University Of Queensland | Method and apparatus for assessing psychiatric or physical disorders |
JP3888306B2 (en) * | 2002-12-27 | 2007-02-28 | ブラザー工業株式会社 | Data processing device |
US7751624B2 (en) * | 2004-08-19 | 2010-07-06 | Nextace Corporation | System and method for automating document search and report generation |
US7421651B2 (en) * | 2004-12-30 | 2008-09-02 | Google Inc. | Document segmentation based on visual gaps |
US7545981B2 (en) | 2005-11-04 | 2009-06-09 | Xerox Corporation | Document image re-ordering systems and methods |
WO2007070010A1 (en) * | 2005-12-16 | 2007-06-21 | Agency For Science, Technology And Research | Improvements in electronic document analysis |
US20070300295A1 (en) * | 2006-06-22 | 2007-12-27 | Thomas Yu-Kiu Kwok | Systems and methods to extract data automatically from a composite electronic document |
US20080065671A1 (en) * | 2006-09-07 | 2008-03-13 | Xerox Corporation | Methods and apparatuses for detecting and labeling organizational tables in a document |
US7899837B2 (en) | 2006-09-29 | 2011-03-01 | Business Objects Software Ltd. | Apparatus and method for generating queries and reports |
US8126887B2 (en) * | 2006-09-29 | 2012-02-28 | Business Objects Software Ltd. | Apparatus and method for searching reports |
US8204895B2 (en) * | 2006-09-29 | 2012-06-19 | Business Objects Software Ltd. | Apparatus and method for receiving a report |
US8108764B2 (en) * | 2007-10-03 | 2012-01-31 | Esker, Inc. | Document recognition using static and variable strings to create a document signature |
JP5338063B2 (en) * | 2007-10-31 | 2013-11-13 | 富士通株式会社 | Image recognition program, image recognition apparatus, and image recognition method |
US8358852B2 (en) * | 2008-03-31 | 2013-01-22 | Lexmark International, Inc. | Automatic forms identification systems and methods |
US8473467B2 (en) * | 2009-01-02 | 2013-06-25 | Apple Inc. | Content profiling to dynamically configure content processing |
US20110255794A1 (en) * | 2010-01-15 | 2011-10-20 | Copanion, Inc. | Systems and methods for automatically extracting data by narrowing data search scope using contour matching |
US8170372B2 (en) | 2010-08-06 | 2012-05-01 | Kennedy Michael B | System and method to find the precise location of objects of interest in digital images |
US8380753B2 (en) * | 2011-01-18 | 2013-02-19 | Apple Inc. | Reconstruction of lists in a document |
US8543911B2 (en) | 2011-01-18 | 2013-09-24 | Apple Inc. | Ordering document content based on reading flow |
US8996350B1 (en) | 2011-11-02 | 2015-03-31 | Dub Software Group, Inc. | System and method for automatic document management |
CN103164388B (en) * | 2011-12-09 | 2016-07-06 | 北大方正集团有限公司 | In a kind of layout files structured message obtain method and device |
US9064191B2 (en) | 2012-01-26 | 2015-06-23 | Qualcomm Incorporated | Lower modifier detection and extraction from devanagari text images to improve OCR performance |
US8831381B2 (en) | 2012-01-26 | 2014-09-09 | Qualcomm Incorporated | Detecting and correcting skew in regions of text in natural images |
US11631265B2 (en) * | 2012-05-24 | 2023-04-18 | Esker, Inc. | Automated learning of document data fields |
US9014480B2 (en) | 2012-07-19 | 2015-04-21 | Qualcomm Incorporated | Identifying a maximally stable extremal region (MSER) in an image by skipping comparison of pixels in the region |
US9141874B2 (en) | 2012-07-19 | 2015-09-22 | Qualcomm Incorporated | Feature extraction and use with a probability density function (PDF) divergence metric |
US9076242B2 (en) | 2012-07-19 | 2015-07-07 | Qualcomm Incorporated | Automatic correction of skew in natural images and video |
US9047540B2 (en) | 2012-07-19 | 2015-06-02 | Qualcomm Incorporated | Trellis based word decoder with reverse pass |
US20140023275A1 (en) * | 2012-07-19 | 2014-01-23 | Qualcomm Incorporated | Redundant aspect ratio decoding of devanagari characters |
US9262699B2 (en) | 2012-07-19 | 2016-02-16 | Qualcomm Incorporated | Method of handling complex variants of words through prefix-tree based decoding for Devanagiri OCR |
EP2988259A1 (en) * | 2014-08-22 | 2016-02-24 | Accenture Global Services Limited | Intelligent receipt scanning and analysis |
US20170116194A1 (en) * | 2015-10-23 | 2017-04-27 | International Business Machines Corporation | Ingestion planning for complex tables |
CN110472209B (en) * | 2019-07-04 | 2024-02-06 | 深圳同奈信息科技有限公司 | Deep learning-based table generation method and device and computer equipment |
US11380116B2 (en) * | 2019-10-22 | 2022-07-05 | International Business Machines Corporation | Automatic delineation and extraction of tabular data using machine learning |
US20220171922A1 (en) * | 2020-12-01 | 2022-06-02 | Jpmorgan Chase Bank, N.A. | Method and system for conditioned generation of descriptive commentary for quantitative data |
US20220374791A1 (en) * | 2021-05-19 | 2022-11-24 | Kpmg Llp | System and method for implementing a commercial leakage platform |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5140650A (en) * | 1989-02-02 | 1992-08-18 | International Business Machines Corporation | Computer-implemented method for automatic extraction of data from printed forms |
US5258855A (en) * | 1991-03-20 | 1993-11-02 | System X, L. P. | Information processing methodology |
US5293429A (en) * | 1991-08-06 | 1994-03-08 | Ricoh Company, Ltd. | System and method for automatically classifying heterogeneous business forms |
US5416849A (en) * | 1992-10-21 | 1995-05-16 | International Business Machines Corporation | Data processing system and method for field extraction of scanned images of document forms |
US5692073A (en) * | 1996-05-03 | 1997-11-25 | Xerox Corporation | Formless forms and paper web using a reference-based mark extraction technique |
US5721940A (en) * | 1993-11-24 | 1998-02-24 | Canon Information Systems, Inc. | Form identification and processing system using hierarchical form profiles |
US5748809A (en) * | 1995-04-21 | 1998-05-05 | Xerox Corporation | Active area identification on a machine readable form using form landmarks |
US5848186A (en) * | 1995-08-11 | 1998-12-08 | Canon Kabushiki Kaisha | Feature extraction system for identifying text within a table image |
US5950196A (en) * | 1997-07-25 | 1999-09-07 | Sovereign Hill Software, Inc. | Systems and methods for retrieving tabular data from textual sources |
US6009196A (en) * | 1995-11-28 | 1999-12-28 | Xerox Corporation | Method for classifying non-running text in an image |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002511257A (en) * | 1998-04-14 | 2002-04-16 | カイロン コーポレイション | Non-cloning technology for expressing the gene of interest |
-
2002
- 2002-05-17 US US10/146,959 patent/US7142728B2/en not_active Expired - Fee Related
-
2006
- 2006-10-10 US US11/544,693 patent/US20070053611A1/en not_active Abandoned
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5140650A (en) * | 1989-02-02 | 1992-08-18 | International Business Machines Corporation | Computer-implemented method for automatic extraction of data from printed forms |
US5768416A (en) * | 1991-03-20 | 1998-06-16 | Millennium L.P. | Information processing methodology |
US5369508A (en) * | 1991-03-20 | 1994-11-29 | System X, L. P. | Information processing methodology |
US5625465A (en) * | 1991-03-20 | 1997-04-29 | International Patent Holdings Ltd. | Information processing methodology |
US5258855A (en) * | 1991-03-20 | 1993-11-02 | System X, L. P. | Information processing methodology |
US6094505A (en) * | 1991-03-20 | 2000-07-25 | Millennium L.P. | Information processing methodology |
US5293429A (en) * | 1991-08-06 | 1994-03-08 | Ricoh Company, Ltd. | System and method for automatically classifying heterogeneous business forms |
US5416849A (en) * | 1992-10-21 | 1995-05-16 | International Business Machines Corporation | Data processing system and method for field extraction of scanned images of document forms |
US5721940A (en) * | 1993-11-24 | 1998-02-24 | Canon Information Systems, Inc. | Form identification and processing system using hierarchical form profiles |
US5748809A (en) * | 1995-04-21 | 1998-05-05 | Xerox Corporation | Active area identification on a machine readable form using form landmarks |
US5848186A (en) * | 1995-08-11 | 1998-12-08 | Canon Kabushiki Kaisha | Feature extraction system for identifying text within a table image |
US6009196A (en) * | 1995-11-28 | 1999-12-28 | Xerox Corporation | Method for classifying non-running text in an image |
US5692073A (en) * | 1996-05-03 | 1997-11-25 | Xerox Corporation | Formless forms and paper web using a reference-based mark extraction technique |
US5950196A (en) * | 1997-07-25 | 1999-09-07 | Sovereign Hill Software, Inc. | Systems and methods for retrieving tabular data from textual sources |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9727804B1 (en) * | 2005-04-15 | 2017-08-08 | Matrox Electronic Systems, Ltd. | Method of correcting strings |
US20070168382A1 (en) * | 2006-01-03 | 2007-07-19 | Michael Tillberg | Document analysis system for integration of paper records into a searchable electronic database |
US20080212845A1 (en) * | 2007-02-26 | 2008-09-04 | Emc Corporation | Automatic form generation |
US7886219B2 (en) * | 2007-02-26 | 2011-02-08 | Emc Corporation | Automatic form generation |
US20100238474A1 (en) * | 2009-03-17 | 2010-09-23 | Konica Minolta Business Technologies, Inc. | Document image processing apparatus, document image processing method, and computer-readable recording medium having recorded document image processing program |
US8837818B2 (en) * | 2009-03-17 | 2014-09-16 | Konica Minolta Business Technologies, Inc. | Document image processing apparatus, document image processing method, and computer-readable recording medium having recorded document image processing program |
US20110010348A1 (en) * | 2009-07-09 | 2011-01-13 | International Business Machines Corporation | Rule-based record profiles to automate record declaration of electronic documents |
US8290916B2 (en) | 2009-07-09 | 2012-10-16 | International Business Machines Corporation | Rule-based record profiles to automate record declaration of electronic documents |
US9418051B2 (en) * | 2011-12-21 | 2016-08-16 | Peking University Founder Group Co., Ltd. | Methods and devices for extracting document structure |
US20130167018A1 (en) * | 2011-12-21 | 2013-06-27 | Beijing Founder Apabi Technology Ltd. | Methods and Devices for Extracting Document Structure |
US9639900B2 (en) | 2013-02-28 | 2017-05-02 | Intuit Inc. | Systems and methods for tax data capture and use |
US10878516B2 (en) | 2013-02-28 | 2020-12-29 | Intuit Inc. | Tax document imaging and processing |
AU2013379776B2 (en) * | 2013-02-28 | 2017-08-24 | Intuit Inc. | Presentation of image of source of tax data through tax preparation application |
US9916626B2 (en) | 2013-02-28 | 2018-03-13 | Intuit Inc. | Presentation of image of source of tax data through tax preparation application |
US20140369602A1 (en) * | 2013-06-14 | 2014-12-18 | Lexmark International Technology S.A. | Methods for Automatic Structured Extraction of Data in OCR Documents Having Tabular Data |
US9785830B2 (en) * | 2013-06-14 | 2017-10-10 | Kofax International Switzerland Sarl | Methods for automatic structured extraction of data in OCR documents having tabular data |
US9251413B2 (en) * | 2013-06-14 | 2016-02-02 | Lexmark International Technology, SA | Methods for automatic structured extraction of data in OCR documents having tabular data |
FR3012062A1 (en) * | 2013-10-18 | 2015-04-24 | Gerlon Sa | SANDING EQUIPMENT |
US10204095B1 (en) * | 2015-02-10 | 2019-02-12 | West Corporation | Processing and delivery of private electronic documents |
US20190034399A1 (en) * | 2015-06-30 | 2019-01-31 | Datawatch Corporation | Systems and methods for automatically creating tables using auto-generated templates |
US10853566B2 (en) * | 2015-06-30 | 2020-12-01 | Datawatch Corporation | Systems and methods for automatically creating tables using auto-generated templates |
US11281852B2 (en) | 2015-06-30 | 2022-03-22 | Datawatch Corporation | Systems and methods for automatically creating tables using auto-generated templates |
US10438083B1 (en) | 2016-09-27 | 2019-10-08 | Matrox Electronic Systems Ltd. | Method and system for processing candidate strings generated by an optical character recognition process |
US10713524B2 (en) * | 2018-10-10 | 2020-07-14 | Microsoft Technology Licensing, Llc | Key value extraction from documents |
US11348330B2 (en) * | 2018-10-10 | 2022-05-31 | Microsoft Technology Licensing, Llc | Key value extraction from documents |
US20220245377A1 (en) * | 2021-01-29 | 2022-08-04 | Intuit Inc. | Automated text information extraction from electronic documents |
Also Published As
Publication number | Publication date |
---|---|
US7142728B2 (en) | 2006-11-28 |
US20030215137A1 (en) | 2003-11-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7142728B2 (en) | Method and system for extracting information from a document | |
Lu et al. | Document image retrieval through word shape coding | |
US5164899A (en) | Method and apparatus for computer understanding and manipulation of minimally formatted text documents | |
US6178417B1 (en) | Method and means of matching documents based on text genre | |
US7764830B1 (en) | Machine learning of document templates for data extraction | |
US7561734B1 (en) | Machine learning of document templates for data extraction | |
US6996295B2 (en) | Automatic document reading system for technical drawings | |
US6909805B2 (en) | Detecting and utilizing add-on information from a scanned document image | |
US6044375A (en) | Automatic extraction of metadata using a neural network | |
CN101366020B (en) | Table detection in ink notes | |
US20070168382A1 (en) | Document analysis system for integration of paper records into a searchable electronic database | |
US7668372B2 (en) | Method and system for collecting data from a plurality of machine readable documents | |
US6621941B1 (en) | System of indexing a two dimensional pattern in a document drawing | |
KR100487386B1 (en) | Retrieval of cursive chinese handwritten annotations based on radical model | |
US20090144277A1 (en) | Electronic table of contents entry classification and labeling scheme | |
Hu et al. | Comparison and classification of documents based on layout similarity | |
US6321232B1 (en) | Method for creating a geometric hash tree in a document processing system | |
US20040078755A1 (en) | System and method for processing forms | |
US20070140566A1 (en) | Framework for detecting a structured handwritten object | |
US20140307959A1 (en) | Method and system of pre-analysis and automated classification of documents | |
US20110188759A1 (en) | Method and System of Pre-Analysis and Automated Classification of Documents | |
JP4649512B2 (en) | Character string search method and apparatus | |
EP0923044A2 (en) | Method and means of matching documents based on spatial region layout | |
Chen et al. | Summarization of imaged documents without OCR | |
Liu et al. | Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SCIENCE APPLICATIONS INTERNATIONAL CORP., CALIFORN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WNEK, JANUSZ;REEL/FRAME:018400/0091 Effective date: 20020516 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |