US20080228469A1 - Rollup functions for efficient storage, presentation, and analysis of data - Google Patents

Rollup functions for efficient storage, presentation, and analysis of data Download PDF

Info

Publication number
US20080228469A1
US20080228469A1 US12/106,779 US10677908A US2008228469A1 US 20080228469 A1 US20080228469 A1 US 20080228469A1 US 10677908 A US10677908 A US 10677908A US 2008228469 A1 US2008228469 A1 US 2008228469A1
Authority
US
United States
Prior art keywords
parent
rollup
child
matrix
sibling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/106,779
Inventor
David Justin Ross
Stephen E.M. Billester
Brent R. Smith
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Matthews International Corp
Original Assignee
Raf Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Raf Technology Inc filed Critical Raf Technology Inc
Priority to US12/106,779 priority Critical patent/US20080228469A1/en
Assigned to RAF TECHNOLOGY, INC. reassignment RAF TECHNOLOGY, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BILLESTER, STEPHEN E.M., ROSS, DAVID J., SMITH, BRENT R.
Publication of US20080228469A1 publication Critical patent/US20080228469A1/en
Assigned to MATTHEWS INTERNATIONAL CORPORATION reassignment MATTHEWS INTERNATIONAL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: RAF TECHNOLOGY, INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/268Lexical context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present invention relates to computer-implemented methods and data structures for producing candidate parent entities that are ranked in accordance with ranking information associated with given child entities and, in particular, to such methods for use with software parsers and data dictionaries, for example, of the kind utilized in a system for automated reading, validation, and interpretation of hand print, machine print, and electronic data streams.
  • Optical character recognition (OCR) systems and digital image processing systems are known for use in automatic forms processing. These systems deal with three kinds of data: physical data, textual data, and logical data.
  • Physical data may be pixels on a page or positional information related to those pixels. In general, physical data is not in a form to be effectively used by a computer for information processing. Physical data by itself has neither useable content nor meaning.
  • Textual data is data in textual form. It may have a physical location associated with it. It occurs in, for example, ASCII strings. It has content but no meaning. We know what textual data says, but not what it means.
  • Logical data has both content and meaning. It often has a name for what it is.
  • region of black pixels there may be region of black pixels in a certain location on an image. Both the value of the pixels and their location are physical data. It may be determined that those pixels, when properly passed through a recognizer, say: “(425) 867-0700.” Content has been derived from the physical data to generate textual data. If we now know that text of this format (or possibly at this location on a preprinted form) is a telephone number, the textual data becomes logical data.
  • each recognized element of textual data may be represented by a ranked group of unique candidates called a “possibility set.”
  • a possibility set includes one or more candidate information pairs, each including a “possibility” and an associated confidence.
  • the confidence is typically assigned as part of the recognition process.
  • the confidences may be assigned within an appropriate base- 2 range, e.g., 0 to 255, or a more compact range, such as 0 to 7.
  • FIG. 1 shows an enlarged view of an individual glyph 20 that may be physically embodied as a handwritten character or as a digital pixel image of the handwritten character. From glyph 20 , an optical character recognition process may generate the possibility set shown in TABLE 1 by assigning possibilities and associated confidences:
  • FIG. 2 shows a series of sibling glyphs 22 , which are known as “siblings” because they share the same parent word 24 .
  • the sibling glyphs 22 can be represented by the four possibility sets as shown in the following TABLE 2:
  • the unique “candidate” strings may be processed by a “dictionary” of valid outcomes.
  • a dictionary is a filter. It has content and rules. Each candidate string processed by the dictionary is subject to one of three possible outcomes: it is passed, it is rejected, or it is modified into a similar string that passes.
  • One example of a dictionary is based on the English language. For parent word 24 of FIG. 2 , the candidate strings “chor” and “ehar” would be rejected by such a dictionary, while “char” would be passed.
  • dictionaries often have a very large amount of content against which a candidate string is compared, it may be unduly time-consuming to apply the dictionary to all possible strings.
  • a convenient way to rank candidate strings is to calculate string confidences based on the confidences of the component character possibilities that make up each candidate string. A set of candidate strings and their associated string confidences is referred to as an “alt-set.”
  • n-gram dictionary which includes information about the frequency in which certain character sequences (e.g., two-letter, three-letter, etc.) occur in the English language. For example, the two-letter combination “Qu” (a 2-gram) occurs in English words much more frequently than “Qo.”
  • the confidence assigned to an n-gram is some combination of (1) the aggregate character confidences and (2) the n-gram frequency provided by the n-gram dictionary.
  • methods of organizing a series of sibling data entities for preserving sibling ranking information associated with the sibling data entities and for attaching the sibling ranking information to a joint parent of the sibling data entities to facilitate on-demand generation of ranked parent candidates.
  • a rollup function of the present invention builds a rollup matrix containing information about the sibling entities and the sibling ranking information and provides a method for reading out the ranked parent candidates from the rollup matrix in order of their parent confidences, which are based on the sibling ranking information. Parent confidences may also be based, in part, on n-gram ranking or other ranking information.
  • sibling entities are generated and passed to the rollup function for processing.
  • Generation of a series of sibling entities may, in the context of OCR, involve optical scanning, recognition processing, and parsing.
  • Each sibling entity comprises one or more ranked child possibilities, each having an associated child confidence.
  • the number of child possibilities in a sibling entity is referred to as the “child population” of the sibling entity.
  • Each sibling entity may include a range of child confidences, one of which is the maximum child confidence.
  • the rollup function is implemented in computer software operable on a digital computer.
  • the rollup matrix is modeled as a three-dimensional data array called a rollup table.
  • the rollup table serves as a convenient visual aid to understanding the nature of the rollup matrix and operation of the rollup function. What is the matrix? It should be understood that nothing in the foregoing description of the rollup table should be construed as limiting the scope of the invention to implementation of the rollup matrix in data arrays. Other data structures, such as linked lists, are also suitable for implementing the rollup function of the present invention.
  • rollup matrix shall mean data tables, linked lists, and any other device for defining relationships between nodes in a data structure, where such nodes include one or more elements of data and one or more relationships to other nodes, procedures, or nested rollup functions.
  • non-OCR applications of the invention involving resolution of empirical uncertainty may include, for example, bioinformatics systems for analyzing gene sequencing information.
  • a matrix initialization routine of the rollup function establishes a rollup table and sizes it based on properties of the sibling entities.
  • the rollup table is sized to include a series of “columns” equal in number to the number of sibling entities received.
  • the dimension of the rollup table spanned by the columns is referred to as the “width” of the table.
  • the rollup table is sized in a “height” dimension based on a number of “rows,” with each having a row position indicating its position along the height dimension of the data table. The number of rows, and consequently the height of the table, is based on the sum of the maximum child confidences of the sibling entities.
  • the rollup table is sized in a “depth” dimension based on the largest of the child populations of the sibling entities.
  • the rollup table is a collection of “nodes,” each located in the rollup table at a position defined by column, row position, and a depth position in the depth dimension.
  • a loading routine of the rollup function then loads the sibling entities into the rollup table in a predetermined loading sequence beginning with loading a first sibling entity in a first column of the series of columns. Each sibling entity is loaded in sequence, from the first sibling entity to the last sibling entity in the series. If the sibling entities have no serial relationship, then an arbitrary, but ordered sequence of loading is chosen. Each child possibility of the first sibling entity is loaded into a node of the rollup table located at the first column and at the row having a row position corresponding to the child confidence of the child possibility being loaded. The rollup function then proceeds to load the second sibling entity in the series in a second column.
  • the rollup function loads each child possibility in one row of the current column for each row of the immediately preceding column having a filled node.
  • the child possibilities of the second sibling entity are loaded in rows of the second column that have row positions offset from the row positions of filled nodes of the immediately preceding column (i.e., the first column) by an offset amount corresponding to the child confidence of the child possibility being loaded in the second column.
  • the child possibilities of the third sibling entity are loaded in rows of the third column having row positions offset from the row positions of filled nodes of the second column by an offset amount corresponding to the child confidence of the child possibility being loaded in the third column, and so on, until the last sibling entity has been loaded in the last column of the rollup table.
  • Each entry in the last column of the rollup table is a terminal element. Due to different confidence values that may be associated with multiple child possibilities of each of the sibling entities, the loading sequence may result in the loading of multiple elements in a particular column and row position. During loading, if a node has already been filled with a child possibility, the loading routine offsets in the depth of the rollup table until it reaches an unoccupied node, then fills that node.
  • another aspect of the invention involves a roll-out routine of the rollup function, which may be used to read parent candidates from the rollup table according to their parent confidences.
  • the reading of parent candidates known as “roll-out,” begins with a terminal element known as an entry point.
  • Each parent candidate is assembled in a sequence opposite the sequence in which the rollup table was loaded, as follows: After reading a terminal element from the last column, the roll-out routine then reads a next-to-last element from the node located at a next-to-last column immediately preceding the last column and at a row position less than the row position of the entry point by an amount equal to the child confidence associated with the terminal element.
  • next-to-last element is then prepended to the terminal element to form a string tail.
  • a prefix element is read from a node located in the column immediately preceding the next-to-last column and at a row position less than the node of the next-to-last element by an amount equal to the confidence of the next-to-last element.
  • the prefix element is then prepended to the string tail. If the sibling entities forming the rollup table have no serial relationship, then prepending involves combining the elements in reverse order of their loading in the rollup table. This reading process is repeated until the roll-out routine reaches the first column, completing roll-out of the parent candidate.
  • the roll-out routine will continue reading parent candidates beginning from the same entry point until elements at all occupied nodes at all depths in the appropriate columns and rows have been read and all parent candidates having the same parent confidence have been rolled out, or until the desired number of parent candidates have been rolled out. The roll-out process is merely repeated for further parent candidates.
  • the method of loading the data table dictates that each row position corresponds to the parent rank of each parent candidate assembled from a terminal element located at that row position.
  • the parent candidate (or candidates) with the greatest parent confidence may be read from the rollup matrix by beginning at a maximal node located at the last column and at the row of greatest row position. Consequently, parent candidates may be read in decreasing order of parent rank by merely assembling parent candidates in sequence, beginning with terminal element(s) located at the maximal node and continuing to read from the rollup table at entry points of decreasing row position until all parent candidates have been assembled.
  • the process of building a rollup matrix and rolling-out parent candidates to form alt-sets can be repeated at each level in the data hierarchy. If desired, rollup functions can be nested by storing a nested “child” rollup function pointer at a node of a parent roll-up table.
  • the rollup matrix is established in a computer memory using a plurality of memory pointers in place of the 3-dimensional data array of the rollup table.
  • the terms “rows” and “columns” are arbitrary but are used herein to denote memory locations within the rollup matrix.
  • each node of the rollup matrix includes a pointer to other nodes which contain a child possibility of an adjacent sibling entity. If a node must point to more than one child possibility, as in the case of multiple child possibilities at a particular column and row position, the node will include multiple pointers. When these multi-pointer nodes are encountered by the roll-out routine, a branch is indicated so that all pointers of each node are followed before moving to the next entry point.
  • Entry nodes further include a parent confidence which the roll-out routine recognizes as assigned to the parent candidate assembled beginning with the entry node. Entry nodes may also include a pointer to the next entry node in the matrix, which may have the same parent confidence or a lesser parent confidence. Nodes in the “first column,” loaded with a child possibility of the first possibility set, may include a return pointer that may direct the roll-out routine to output the completed parent candidate for verification (e.g., using a dictionary) or to proceed to the next entry node for generation of the next parent candidate. Nodes at any location in the rollup matrix may also include a pointer to an entry node of a nested rollup matrix.
  • n-gram possibility sets are generated using a n-gram rollup function in accordance with the present invention.
  • Comparison of parent candidate n-grams against an n-gram dictionary allows n-gram candidates to be weighted in accordance with their relative frequencies of occurrence in the context of, for example, the English language.
  • Possibility sets including n-grams are readily accommodated in establishing the rollup matrix.
  • the nodes are loaded with the 3-grams at a row position which is the aggregate of the confidence of the central character (of the 3-gram) and the dictionary-provided frequency of the 3-gram.
  • child possibilities in the first and last columns of the rollup matrix must be prepended and appended, respectively, with nulls (or spaces) so that all child possibilities are 3-grams.
  • the 3-gram child possibilities must be loaded in the rollup matrix so that when the parent candidates are rolled-out, all adjacent 3-grams assembled in a parent candidate share two characters. For example, “out” in the first column will fit with “uts” in the second column, but not with “nts.”
  • the rollup function of the present invention is useful at every level of textual hierarchy. Rollup functions also avoid fatal problems often encountered by prior art string generators, which create strings from a series of possibility sets.
  • Existing string generators suffer from three major problems. First, they are combinatorically expensive in memory use-needing a place in memory for each possible string. Second, string generators must trim strings before generating all possible strings because of limited space to store the combinatorically-many strings. Therefore, it is possible for string generators to result in higher-confidence strings being abandoned while lower-confidence strings are preserved. Third, string generators do not guarantee that strings of the same confidence, once ordered, retain that order.
  • the rollup function is only geometrically expensive of memory, not combinatorically.
  • Tables generated by prior art systems grow as L ⁇ n L , where n is the number of possibilities per possibility set and L is the number of possibility sets (i.e., the string length). There are n L strings of length L that can be generated.
  • the rollup matrix of the present invention grows as 2 ⁇ CF max ⁇ L 2 , where CF max is the highest confidence value in any possibility set.
  • candidate strings can be read out of a rollup table in their decreasing order of confidence without having to store unneeded strings in memory, while never skipping a higher-confidence parent candidate for a lower confidence one.
  • the rollup matrix does not change size with the number of generated strings. Therefore, all strings are preserved and there is no trimming of strings ever required.
  • no reordering of parent strings ever takes place because the rollup matrix is unchanging. Consequently, strings of the same confidence remain in their original order.
  • Parent candidates can be read from the rollup matrix in decreasing or increasing order of parent confidence.
  • a parent candidate having a desired confidence value can easily be selected from the matrix by a confidence stored in association with an entry node of the parent candidate.
  • Parent candidates having lesser (or greater) confidences can then be read until a desired lesser (or greater) confidence level is reached. This process can be repeated until a predetermined number of parent candidates have been obtained or until all possible parent candidates have been rolled-out.
  • the rollup function can be interrupted while reading out a parent candidate to handle some other process, such as verifying the most recently rolled-out parent candidate using a dictionary.
  • the rollup function easily returns to where it left off in the rollup matrix to read out the next-ranked parent candidate by returning to the location in the rollup matrix that was being accessed when the interruption occurred.
  • the rollup function of the present invention provides the above-described benefits without requiring the production of all of the parent candidates before subsequent ranking. If a particular child possibility occurs with at most one confidence value in a possibility set, then the last rolled-out string is the pointer structure. Even in the case of allowed duplication, returning to the rollup function is as simple as storing a pointer to the next entry point in the rollup matrix and storing a pointer to each position of the table, which may be accomplished by freezing the internal pointer structure.
  • the rollup function of the present invention is, of course, not limited to strings.
  • Any parent entity can receive rollup-produced alt-sets from its child entities.
  • gene sequence information prepared from a human, an animal, a plant, or any other living organism may be parsed into its nucleotides, each of which may be represented by an alt-set.
  • Sibling nucleotide alt-sets can then be loaded into a rollup matrix for the parent gene.
  • the frequency of naturally-occurring nucleotide and coding sequence variations can easily be represented by the child confidences associated with child possibilities of each alt-set. Inaccuracies inherent in the gene sequencing process can be similarly represented by the child confidences.
  • FIG. 1 is an enlarged view of a hand printed glyph
  • FIG. 2 is an enlarged view of a series of sibling glyphs
  • FIG. 3 is a flow diagram depicting an OCR process for scanning, parsing, and recognizing handwritten data to create possibility sets for use with a data verification routine of the present invention
  • FIG. 4 is a flow diagram showing detail of the data verification routine of FIG. 3 including a rollup function and dictionary routine in accordance with a preferred embodiment of the present invention
  • FIG. 5 is a pictorial view of a three-dimensional data array in accordance with a first preferred embodiment of the present invention
  • FIGS. 6A , 6 B, 6 C, and 6 D are two-dimensional pictorial views of a rollup matrix in accordance with the present invention showing a loading sequence for loading the alt-sets of Table 3 into the rollup matrix;
  • FIG. 7 is an exploded three-dimensional view of the loaded rollup matrix of FIG. 6D ;
  • FIGS. 8A , 8 B, 8 C, and 8 D are show a sequence of rolling out a parent candidate from the loaded rollup matrix of FIG. 6D ;
  • FIG. 9 is a diagram of an alternative embodiment of the rollup matrix of FIG. 6D including a linked list implemented in a computer memory;
  • FIG. 10 is a flow diagram showing steps taken in preparation and validation of n-gram alt-sets for loading in a rollout matrix for a parent string of the n-grams;
  • FIG. 11 is a two-dimensional pictorial view showing nested rollup matrices
  • FIG. 12 is a flow diagram showing steps for establishing and loading of the nested rollup matrices of FIG. 11 ;
  • FIG. 13 is flow diagram showing parent candidates being rolled out from the nested rollup matrices of FIG. 11 .
  • FIG. 3 is a flow diagram of an OCR process 30 in accordance with a first preferred embodiment of the present invention.
  • a document 32 bearing physical textual data is scanned using an optical scanner 34 , which produces a digital pixel image of the physical data on document 32 .
  • a segmentation process 36 of the OCR process 30 receives the pixel image from the optical scanner and segments the pixel image into data segments for processing by a recognizer 38 .
  • Recognizer 38 analyzes the data segments to produce a possibility set (“pos-set”) for each data segment.
  • pos-set a possibility set
  • Empirical uncertainty in the physical data and inaccuracies of the scanning, segmentation and recognition process are represented in the pos-sets by including multiple child possibilities in each pos-set and by assigning child confidences to the child possibilities.
  • recognizer 38 separates a parent string (as in the parent word 24 of FIG. 2 ) into its sibling glyphs and outputs a pos-set for each glyph.
  • the pos-sets are output to a data verification routine 40 , which uses a rollup function 60 ( FIG. 4 ) and possibly one or more dictionaries 150 ( FIG. 4 ) in accordance with the present invention.
  • FIG. 4 is a flow diagram of rollup function 60 of data verification routine 40 ( FIG. 3 ).
  • a matrix initialization routine 62 of rollup function 60 receives pos-sets 64 from recognizer 38 .
  • FIG. 5 is a pictorial view of a three-dimensional data array 66 , which represents a data matrix in accordance with the present invention. Data array 66 , includes rows 70 , columns 72 , and tiers 74 that together form nodes 76 .
  • matrix initialization routine establishes a size of data array 66 based on pos-sets 64 . For purposes of a simple illustration, TABLE 3 presents four sibling pos-sets.
  • a first pos-set shown in TABLE 3 includes two child possibilities, “a” and “o”, which are assigned child confidences 2 and 1 , respectively.
  • a second pos-set includes child possibilities n and u, having associated child confidences 1 and 0 , respectively. And so on.
  • Data array 66 thus, includes six rows 70 , having row heights R 0 , R 1 , R 2 , R 3 , R 4 , and R 5 .
  • a width 82 of data array 66 is equal to the number of pos-sets 64 .
  • a depth 84 of data array 66 is equal to the largest number of child possibilities in any of the pos-sets 64 . In this example, three of the pos-sets are equally large, having two child possibilities.
  • FIGS. 6A , 6 B, 6 C, and 6 D depict a loading sequence followed by loading routine 90 .
  • a data table 92 provides a two-dimensional representation of the three-dimensional data array 66 of FIG. 5 , including four columns C 1 , C 2 , C 3 , and C 4 , each of which is divided by broken lines to indicate tiers 74 of data array 66 ( FIG. 5 ).
  • Loading routine 90 loads the child possibilities 94 of the first pos-set into the first column C 1 so that each child possibility 94 is loaded in a node 96 at a row position equal to the child confidence 98 corresponding the child possibility 94 .
  • child possibility “o”, which has an associated child confidence of one is loaded at the node located at row R 1
  • child possibility “a” is loaded at row R 2 because it has an associated child confidence of two.
  • each child possibility of the second pos-set is loaded in one node 96 of the second column (C 2 ) for each row of the first column (C 1 ) having filled nodes, but at a row height greater than the row height of the filled nodes 96 of column C 1 by an amount equal to the child confidences being loaded.
  • child possibility “u” having a child confidence of zero is loaded in nodes located at rows R 1 and R 2 of column C 2 , since rows R 1 and R 2 are filled in column C 1 .
  • Child possibility “n” is loaded in nodes located at rows R 2 and R 3 of column C 2 , which are greater than the row positions of the filled nodes (R 1 and R 2 ) of column C 1 by an amount equal to the child confidence (one) associated with child possibility “n.” Because the node located at C 2 , R 2 , TO, is already filled with child possibility “u”, loading routine 90 loads child possibility n at node C 2 , R 2 , Ti so that no more than one child possibility is loaded in each node.
  • Loading routine 90 then continues to load successive pos-sets 64 in sequence in successive columns, as depicted in FIGS. 6C and 6D , until all pos-sets 64 have been loaded in data table 92 .
  • child possibilities 94 are loaded in nodes 96 located at row positions that are greater (by an amount equal to the child confidence of the child possibility being loaded) than the row position(s) of rows of the immediately preceding column that have filled nodes.
  • Nodes of the last column (C 4 ) that are loaded with child possibilities contain data entities that are known as terminal elements 100 .
  • FIG. 7 is an exploded view of the loaded data table 92 of FIG. 6D showing its loaded data in a three-dimensional representation in accordance with three-dimensional data array 66 of FIG. 5 .
  • FIG. 8A depicts the steps taken by roll-out routine 110 , in rolling out parent candidate “ants”, i.e., the parent candidate comprising the sibling characters “a”, “n”, “t”, and “s”.
  • Parent candidate “ants” has the greatest aggregate confidence of any of the parent candidates because its terminal element (“s”) 100 is located in the row of data table 92 having the greatest row position (R 5 ), i.e., a maximal terminal element 112 .
  • roll-out routine 110 reads from columns C 4 , C 3 , C 2 , and C 1 , in the order opposite to which the columns were loaded.
  • Terminal element “s” 100 (which is also the maximal terminal element 112 ) is read initially.
  • roll-out routine 110 reads next-to-last child element “t” 116 from the immediately previous column (C 3 ) and from row R 4 , which has a row position less than the row position of terminal element “s” by the amount of the child confidence associated with terminal element “s” (i.e. one).
  • Roll-out routine 110 prepends next-to-last child element “t” to the terminal element “s” to form a string tail of “ts.”
  • the child confidence of one associated with next-to-last child element “t” 116 then directs roll-out routine to read prefix element “n” 118 from row R 3 , column C 2 (because row R 3 has a row position one less than the row position of R 4 ).
  • Roll-out routine 110 prepends prefix element “n” 118 to the string tail “ts”, to form the partial string “nts.”
  • Element “a” 120 is then read because it is loaded in row R 2 , which is one less (the child confidence associated with prefix element “n” 118 ) than the row position of prefix element “n” 118 .
  • Element “a” 120 is prepended to complete the formation of candidate parent string “ants”.
  • the parent confidence associated with “ants” is equal to five, which is the row position of the terminal element 100 a used to extract “ants”.
  • FIG. 8B depicts the steps taken by roll-out routine 110 , in rolling out parent candidate “ant 5 ”.
  • terminal element “ 5 ” has an associated child confidence of zero, which directs roll-out routine to read next-to-last element “t” from the same row position (R 4 ) in column C 3 .
  • the parent confidence associated with “ant 5 ” is equal to four, which is the row position of terminal element “ 5 ” 100 b used to extract “ant 5 ”.
  • FIGS. 8C and 8D depict the steps taken by roll-out routine 110 , in rolling out respective parent candidates “auts” and “onts.” Because there are two entries in row R 2 , column C 2 , roll-out routine 110 rolls out two unique parent candidates ending with terminal element “s” 100 c , both having an associated parent confidence of four, which is equal to the row height of row R 4 , where terminal element “s” 100 c is located.
  • FIG. 9 shows the loaded data table 92 of FIGS. 6D and 7 embodied as a linked-list rollup matrix 126 .
  • rollup matrix 126 includes a pointer structure 128 to nodes 96 .
  • roll-out routine 110 starts at an initial entry point 130 that includes terminal element 100 a (element “s” of maximal terminal element 112 ).
  • Roll-out routine 110 then reads out elements “t” 116 , “n” 118 , and “a” 120 by following respective pointers 134 , 136 , and 138 and prepends them to element “s” 100 a .
  • a return pointer 140 indicates to roll-out routine 110 that it has completed construction of the parent candidate.
  • a parent confidence 141 of the parent candidate “ants” is stored in association with the terminal element “s” 100 a .
  • All terminal elements of rollup matrix 126 serve as entry points 142 for rolling out one or more parent candidates.
  • two parent candidates can be rolled out of rollup matrix 126 by beginning with terminal element “s” 100 c .
  • a branch node 144 of rollup matrix 126 includes two pointers 146 , 148 , which indicate to roll-out routine 110 that two different parent candidates use branch node 144 and that roll-out routine 110 needs to execute a branch at branch node 144 .
  • branch node may clearly exist in rollup matrix, and that some branch nodes will have more than two pointers (if the matrix is “deeper” than 2 tiers).
  • rollup function may output each parent candidate to a dictionary routine 150 ( FIG. 4 ) for validation using an appropriate parser and dictionary.
  • a dictionary routine 150 FIG. 4
  • An iteration step 154 is conditional upon whether the parent candidate output by roll-out routine 110 passes the dictionary test ( 160 ) and, if it does, whether some other stop limit 170 has been met.
  • stop limit 170 may trigger OCR process 30 ( FIG. 3 ) to terminate verification of the parent element represented by rollup matrix 126 (and rollup table 92 ), and to load the next series of pos-sets scanned and recognized from document 32 .
  • FIG. 10 is a flow diagram showing steps taken in preparation and validation of n-gram alt-sets for loading in a rollout matrix for a parent string of the n-grams.
  • an n-gram verification process 200 receives pos-sets from OCR system (step 210 ) and assembles them in computer memory to form a ranked list of n-gram candidates (step 212 ).
  • N-gram candidates within a single ranked list may have different lengths, for example when one of the pos-sets includes both an “m” possibility and an “rn” possibility.
  • a length gage routine 214 of n-gram verification process 200 determines the length of each n-gram candidate.
  • N-gram dictionary 216 is a specialized dictionary or collection of specialized dictionaries that includes information about frequency of occurrence of n-grams (for example 2-grams, 3-grams, etc.) in written language or some subset of written language.
  • N-gram dictionary 216 assigns an n-gram confidence to each n-gram candidate based on (i) the dictionary frequency rating for the n-gram and (ii) a child confidence associated with a central character of the n-gram candidate. N-gram and its associated n-gram confidence are then appended to an n-gram alt-set (step 218 ).
  • Steps 214 , 216 , and 218 are then repeated until all of the lists of n-gram parent candidates have been processed through the dictionary and output as n-gram alt-sets.
  • a string-sized rollup matrix is built using the alt-sets as sibling entities (step 220 ).
  • Parent string candidates can then be rolled out of string-sized rollup matrix in ranked order (step 222 ) and processed using a string dictionary (step 224 ) before outputting ranked parent strings (step 226 ).
  • FIG. 11 is a two-dimensional pictorial view showing nested rollup matrices 240 established in accordance with the present invention.
  • nested rollup matrices 240 include a child rollup matrix 250 nested within a parent rollup matrix 260 .
  • Child rollup matrix 250 is said to be “nested” because complete candidates that may be rolled out of child rollup matrix 250 are referenced by pointers within parent rollup matrix 260 .
  • child rollup matrix 250 represents candidate city names in a typical rollup matrix in accordance with the present invention. However, any child entity can be represented in a nested child rollup matrix.
  • Parent rollup matrix 260 is a typical rollup matrix in accordance with the present invention.
  • parent rollup matrix 260 includes sibling city, state, and zip-code alt-sets.
  • First and second city nodes 262 , 264 of parent rollup matrix 260 include respective first and second city pointers 266 , 268 to respective first and second entry points 270 , 272 of child rollup matrix 250 .
  • First and second entry points 270 , 272 are terminal nodes of child rollup matrix 250 having associated city confidences 274 , 276 .
  • the nested rollup matrices 240 of FIG. 11 include only one nested child matrix, it would be straightforward to nest multiple child matrices within a single parent rollup matrix. Likewise, it would be simple to create a hierarchy of nested rollup matrices including three or more layers of rollup matrices, rather than the two layers (child rollup matrix 250 and parent rollup matrix 260 ) of FIG. 11 .
  • child rollup matrix 250 is established before establishing parent rollup matrix 260 .
  • This order of establishing nested rollup matrices 240 insures that city confidences 274 , 276 of child rollup matrix 250 may be taken into account when establishing, sizing, and loading parent rollup matrix 260 .
  • city confidences 274 , 276 of child rollup matrix 250 determine how parent rollup matrix 260 is loaded.
  • FIG. 12 is a flow diagram showing steps for establishing and loading of the nested rollup matrices of FIG. 11 .
  • a child rollup matrix is first established and loaded (step 300 ). Once loaded, entry points for child candidates of the child rollup matrix, and their associated child confidences are available. These child candidates, entry points, and child confidences are then taken into account in establishing and sizing parent rollup matrix (step 310 ).
  • Parent rollup matrix is then loaded (step 320 ).
  • parent rollup matrix 260 is loaded with a zip-code (postal code) alt-set in its terminal column and a state alt-set in its next-to-last column.
  • Parent rollup matrix is also loaded with city pointers 266 , 268 to appropriate entry points 270 , 272 of child rollup matrix 250 .
  • ranked parent candidates may then be rolled out (step 330 ) for processing by a dictionary.
  • the dictionary required for use with the nested rollup matrices 240 shown in the example of FIG. 11 would be a city-state-zip dictionary for verifying specific city-state-zip combinations.
  • FIG. 13 is flow diagram showing a sequence of steps for rolling out a parent candidate from the nested rollup matrices 240 of FIG. 11 .
  • a nested roll-out routine 400 starts at an entry point, which is a terminal parent node of a linked list of parent matrix (step 410 ). All subsequent steps shown in FIG. 13 are identical regardless of whether the current node is a terminal node or another node of nested rollup matrices 240 .
  • Nested roll-out routine 400 next determines whether the parent node includes a pointer to a nested child matrix (step 420 ).
  • nested roll-out routine 400 reads the element stored in the current node (step 430 ) and prepends it to a parent candidate tail. Nested roll-out routine 400 , then determines whether the node includes a return pointer that would indicate completion of the parent candidate (step 440 ). If not, then nested roll-out routine advances to the next node in the linked list (step 450 ) and returns to step 420 . If a parent node includes a nested matrix pointer to a nested rollup matrix (at step 410 ) then nested roll-out routine 400 proceeds to store in memory an address of the parent node that includes the nested matrix pointer (step 460 ).
  • Nested roll-out routine 400 then rolls out a child candidate from the nested child matrix (step 470 ), prepends the child candidate to the parent candidate tail (step 480 ). Nested roll-out routine then restores the address of the last-read parent node, which was previously stored in memory and returns to the parent rollup function (step 490 ), continuing on at the last read parent node.
  • nested roll-out routine completes its assembly of parent candidate and processes it using dictionary process 500 . If the parent candidate passes the dictionary test, it is output. The nested roll-out function can be repeated for each terminal node of parent roll-out matrix to complete roll out of all parent candidates.

Abstract

Methods of organizing a series of sibling data entities in a digital computer are provided for preserving sibling ranking information associated with the sibling data entities and for attaching the sibling ranking information to a joint parent of the sibling data entities to facilitate on-demand generation of ranked parent candidates. A rollup function of the present invention builds a rollup matrix (126) that embodies information about the sibling entities and the sibling ranking information and provides a method for reading out the ranked parent candidates from the rollup matrix in order of their parent confidences (141). Parent confidences are based on the sibling ranking information, either alone or in combination with n-gram dictionary ranking or other ranking information.

Description

    RELATED APPLICATIONS
  • This application is a continuation of prior pending U.S. application Ser. No. 10/410,015 filed Apr. 8, 2003, which is a continuation of U.S. application Ser. No. 09/528,749 filed Mar. 20, 2000, now issued as U.S. Pat. No. 6,597,809, all of which claim priority to U.S. provisional application Nos. 60/125,352 filed Mar. 19, 1999 and 60/125,257 filed Mar. 19, 1999 and all are incorporated herein by this reference.
  • TECHNICAL FIELD
  • The present invention relates to computer-implemented methods and data structures for producing candidate parent entities that are ranked in accordance with ranking information associated with given child entities and, in particular, to such methods for use with software parsers and data dictionaries, for example, of the kind utilized in a system for automated reading, validation, and interpretation of hand print, machine print, and electronic data streams.
  • BACKGROUND OF THE INVENTION
  • Optical character recognition (OCR) systems and digital image processing systems are known for use in automatic forms processing. These systems deal with three kinds of data: physical data, textual data, and logical data. Physical data may be pixels on a page or positional information related to those pixels. In general, physical data is not in a form to be effectively used by a computer for information processing. Physical data by itself has neither useable content nor meaning. Textual data is data in textual form. It may have a physical location associated with it. It occurs in, for example, ASCII strings. It has content but no meaning. We know what textual data says, but not what it means. Logical data has both content and meaning. It often has a name for what it is.
  • For example, there may be region of black pixels in a certain location on an image. Both the value of the pixels and their location are physical data. It may be determined that those pixels, when properly passed through a recognizer, say: “(425) 867-0700.” Content has been derived from the physical data to generate textual data. If we now know that text of this format (or possibly at this location on a preprinted form) is a telephone number, the textual data becomes logical data.
  • To facilitate reconciliation of imperfections in physical data and shortcomings of the recognition process, each recognized element of textual data, e.g., a character, may be represented by a ranked group of unique candidates called a “possibility set.” A possibility set includes one or more candidate information pairs, each including a “possibility” and an associated confidence. In the context of an OCR system, the confidence is typically assigned as part of the recognition process. For computational efficiency, the confidences may be assigned within an appropriate base-2 range, e.g., 0 to 255, or a more compact range, such as 0 to 7. For example, FIG. 1 shows an enlarged view of an individual glyph 20 that may be physically embodied as a handwritten character or as a digital pixel image of the handwritten character. From glyph 20, an optical character recognition process may generate the possibility set shown in TABLE 1 by assigning possibilities and associated confidences:
  • TABLE 1
    possibility confidence
    c 200
    e 123
    o 100
  • FIG. 2 shows a series of sibling glyphs 22, which are known as “siblings” because they share the same parent word 24. The sibling glyphs 22 can be represented by the four possibility sets as shown in the following TABLE 2:
  • TABLE 2
    poss conf poss conf poss conf poss conf
    c 200 h 190 o 100 r 125
    o 150 n 100 a 80 n 100
    e 100 r 80

    The possibilities of these four possibility sets can be readily combined to form 36 unique strings: “chor”, “ohor”, “ehor”, “cnor”, “cror”, etc. The number of unique strings is determined by the product of the number of character possibilities in each possibility set, i.e., 3×3×2×2=26.
  • To gage or verify their accuracy, the unique “candidate” strings may be processed by a “dictionary” of valid outcomes. In the context of OCR, a dictionary is a filter. It has content and rules. Each candidate string processed by the dictionary is subject to one of three possible outcomes: it is passed, it is rejected, or it is modified into a similar string that passes. One example of a dictionary is based on the English language. For parent word 24 of FIG. 2, the candidate strings “chor” and “ehar” would be rejected by such a dictionary, while “char” would be passed.
  • Because dictionaries often have a very large amount of content against which a candidate string is compared, it may be unduly time-consuming to apply the dictionary to all possible strings. To improve efficiency, it is desirable, before applying a dictionary, to rank the candidate strings in order of some confidence based on the accuracy of recognition. In this way the candidate strings having the highest confidence of having been accurately recognized are processed by the dictionary first. Rules can then be used to determine when to stop dictionary processing, e.g., when enough candidate strings have been processed to have isolated the best candidate strings (with a certain probability). A convenient way to rank candidate strings is to calculate string confidences based on the confidences of the component character possibilities that make up each candidate string. A set of candidate strings and their associated string confidences is referred to as an “alt-set.”
  • One way to rank parent candidates for creating an alt-set is to add the child confidences for each parent candidate. In the above example, “chor” would have a ranking of 615 (the sum of the confidences associated with the individual characters c-h-o-r), “ohor” would have a ranking of 565, “ehor” would have a ranking of 515, etc. Combining the possibility sets to form the 36 unique strings and to calculate their rankings is simple in this example. However, there is no obvious way to read the strings out in ranked order. The strings must first be assigned a ranking, then ordered or sorted based on their assigned rank. This ordering or sorting step becomes especially problematic for longer strings formed from sibling possibility sets having a greater number of possibilities. By way of illustration, a hypothetical 10-character parent word in which each child possibility set includes 10 possibilities would result in 10 billion unique strings. It would be a very time-consuming and computationally expensive task to rank and order 10 billion 10-character strings.
  • Another known way of improving the efficiency of dictionaries is to use specialized dictionaries that contain smaller amounts of content than a more generalized dictionary but that are limited in their application. One such specialized dictionary is an “n-gram” dictionary, which includes information about the frequency in which certain character sequences (e.g., two-letter, three-letter, etc.) occur in the English language. For example, the two-letter combination “Qu” (a 2-gram) occurs in English words much more frequently than “Qo.” To benefit from an n-gram dictionary, the confidence assigned to an n-gram is some combination of (1) the aggregate character confidences and (2) the n-gram frequency provided by the n-gram dictionary. Thus, recognition may have produced Oueen and Queen where the first character has the possibility set: poss=O, conf=200; poss=Q, conf=100, but in the English language “Qu” happens much more often than “Ou”, so the 2-gram dictionary would help determine that Queen is the more likely parent string.
  • A need exists for a method of generating candidate strings in ranked order on an as-needed basis and, more generally, for a method of generating ranked parent candidates on an on-demand basis from a series of sibling possibilities. A need also exists for such a method that can be used with data at different logical levels in a logical data hierarchy, such as n-grams, words, and phrases.
  • SUMMARY OF THE INVENTION
  • In accordance with the present invention, methods of organizing a series of sibling data entities are provided for preserving sibling ranking information associated with the sibling data entities and for attaching the sibling ranking information to a joint parent of the sibling data entities to facilitate on-demand generation of ranked parent candidates. A rollup function of the present invention builds a rollup matrix containing information about the sibling entities and the sibling ranking information and provides a method for reading out the ranked parent candidates from the rollup matrix in order of their parent confidences, which are based on the sibling ranking information. Parent confidences may also be based, in part, on n-gram ranking or other ranking information.
  • External to the rollup function of the present invention, sibling entities are generated and passed to the rollup function for processing. Generation of a series of sibling entities may, in the context of OCR, involve optical scanning, recognition processing, and parsing. Each sibling entity comprises one or more ranked child possibilities, each having an associated child confidence. The number of child possibilities in a sibling entity is referred to as the “child population” of the sibling entity. Each sibling entity may include a range of child confidences, one of which is the maximum child confidence.
  • In one aspect of the invention the rollup function is implemented in computer software operable on a digital computer. The rollup matrix is modeled as a three-dimensional data array called a rollup table. The rollup table serves as a convenient visual aid to understanding the nature of the rollup matrix and operation of the rollup function. What is the matrix? It should be understood that nothing in the foregoing description of the rollup table should be construed as limiting the scope of the invention to implementation of the rollup matrix in data arrays. Other data structures, such as linked lists, are also suitable for implementing the rollup function of the present invention. It should be understood, therefore, that the term “rollup matrix” as used herein shall mean data tables, linked lists, and any other device for defining relationships between nodes in a data structure, where such nodes include one or more elements of data and one or more relationships to other nodes, procedures, or nested rollup functions. Furthermore, it will be apparent from the foregoing description of the invention that while the invention is suitable for use with OCR technology, it is also suitable for use with processing of other types of content-bearing data in which uncertainty in the data content is sought to be resolved. Non-OCR applications of the invention involving resolution of empirical uncertainty may include, for example, bioinformatics systems for analyzing gene sequencing information.
  • After receiving a series of sibling data entities, a matrix initialization routine of the rollup function establishes a rollup table and sizes it based on properties of the sibling entities. In particular, the rollup table is sized to include a series of “columns” equal in number to the number of sibling entities received. The dimension of the rollup table spanned by the columns is referred to as the “width” of the table. The rollup table is sized in a “height” dimension based on a number of “rows,” with each having a row position indicating its position along the height dimension of the data table. The number of rows, and consequently the height of the table, is based on the sum of the maximum child confidences of the sibling entities. In practice, the number of rows may be established as equal to the sum of the maximum child confidences plus one. The rollup table is sized in a “depth” dimension based on the largest of the child populations of the sibling entities. The rollup table is a collection of “nodes,” each located in the rollup table at a position defined by column, row position, and a depth position in the depth dimension.
  • Once the rollup function has established the rollup table, a loading routine of the rollup function then loads the sibling entities into the rollup table in a predetermined loading sequence beginning with loading a first sibling entity in a first column of the series of columns. Each sibling entity is loaded in sequence, from the first sibling entity to the last sibling entity in the series. If the sibling entities have no serial relationship, then an arbitrary, but ordered sequence of loading is chosen. Each child possibility of the first sibling entity is loaded into a node of the rollup table located at the first column and at the row having a row position corresponding to the child confidence of the child possibility being loaded. The rollup function then proceeds to load the second sibling entity in the series in a second column. For the second and each subsequent sibling entity and column, the rollup function loads each child possibility in one row of the current column for each row of the immediately preceding column having a filled node. The child possibilities of the second sibling entity are loaded in rows of the second column that have row positions offset from the row positions of filled nodes of the immediately preceding column (i.e., the first column) by an offset amount corresponding to the child confidence of the child possibility being loaded in the second column. The child possibilities of the third sibling entity are loaded in rows of the third column having row positions offset from the row positions of filled nodes of the second column by an offset amount corresponding to the child confidence of the child possibility being loaded in the third column, and so on, until the last sibling entity has been loaded in the last column of the rollup table. Each entry in the last column of the rollup table is a terminal element. Due to different confidence values that may be associated with multiple child possibilities of each of the sibling entities, the loading sequence may result in the loading of multiple elements in a particular column and row position. During loading, if a node has already been filled with a child possibility, the loading routine offsets in the depth of the rollup table until it reaches an unoccupied node, then fills that node.
  • Upon completion of the loading sequence, another aspect of the invention involves a roll-out routine of the rollup function, which may be used to read parent candidates from the rollup table according to their parent confidences. The reading of parent candidates, known as “roll-out,” begins with a terminal element known as an entry point. Each parent candidate is assembled in a sequence opposite the sequence in which the rollup table was loaded, as follows: After reading a terminal element from the last column, the roll-out routine then reads a next-to-last element from the node located at a next-to-last column immediately preceding the last column and at a row position less than the row position of the entry point by an amount equal to the child confidence associated with the terminal element. The next-to-last element is then prepended to the terminal element to form a string tail. A prefix element is read from a node located in the column immediately preceding the next-to-last column and at a row position less than the node of the next-to-last element by an amount equal to the confidence of the next-to-last element. The prefix element is then prepended to the string tail. If the sibling entities forming the rollup table have no serial relationship, then prepending involves combining the elements in reverse order of their loading in the rollup table. This reading process is repeated until the roll-out routine reaches the first column, completing roll-out of the parent candidate. If more than one element is located at a particular column and row location (i.e., elements are stored at more than one depth position), then the roll-out routine will continue reading parent candidates beginning from the same entry point until elements at all occupied nodes at all depths in the appropriate columns and rows have been read and all parent candidates having the same parent confidence have been rolled out, or until the desired number of parent candidates have been rolled out. The roll-out process is merely repeated for further parent candidates.
  • The method of loading the data table dictates that each row position corresponds to the parent rank of each parent candidate assembled from a terminal element located at that row position. The parent candidate (or candidates) with the greatest parent confidence may be read from the rollup matrix by beginning at a maximal node located at the last column and at the row of greatest row position. Consequently, parent candidates may be read in decreasing order of parent rank by merely assembling parent candidates in sequence, beginning with terminal element(s) located at the maximal node and continuing to read from the rollup table at entry points of decreasing row position until all parent candidates have been assembled. The process of building a rollup matrix and rolling-out parent candidates to form alt-sets can be repeated at each level in the data hierarchy. If desired, rollup functions can be nested by storing a nested “child” rollup function pointer at a node of a parent roll-up table.
  • Given the foregoing description of the invention, the use of software counters to facilitate the loading of the rollup matrix and the roll-out of parent candidates will be understood by those skilled in the art.
  • In another aspect of the invention, the rollup matrix is established in a computer memory using a plurality of memory pointers in place of the 3-dimensional data array of the rollup table. In this aspect of the invention, the terms “rows” and “columns” are arbitrary but are used herein to denote memory locations within the rollup matrix. In reality, each node of the rollup matrix includes a pointer to other nodes which contain a child possibility of an adjacent sibling entity. If a node must point to more than one child possibility, as in the case of multiple child possibilities at a particular column and row position, the node will include multiple pointers. When these multi-pointer nodes are encountered by the roll-out routine, a branch is indicated so that all pointers of each node are followed before moving to the next entry point.
  • Nodes occupying entry points shall be referred to as “entry nodes.” Entry nodes further include a parent confidence which the roll-out routine recognizes as assigned to the parent candidate assembled beginning with the entry node. Entry nodes may also include a pointer to the next entry node in the matrix, which may have the same parent confidence or a lesser parent confidence. Nodes in the “first column,” loaded with a child possibility of the first possibility set, may include a return pointer that may direct the roll-out routine to output the completed parent candidate for verification (e.g., using a dictionary) or to proceed to the next entry node for generation of the next parent candidate. Nodes at any location in the rollup matrix may also include a pointer to an entry node of a nested rollup matrix.
  • In yet another aspect of the invention, n-gram possibility sets are generated using a n-gram rollup function in accordance with the present invention. Comparison of parent candidate n-grams against an n-gram dictionary allows n-gram candidates to be weighted in accordance with their relative frequencies of occurrence in the context of, for example, the English language. Possibility sets including n-grams are readily accommodated in establishing the rollup matrix. For 3-grams, the nodes are loaded with the 3-grams at a row position which is the aggregate of the confidence of the central character (of the 3-gram) and the dictionary-provided frequency of the 3-gram. In this aspect of the invention, child possibilities in the first and last columns of the rollup matrix must be prepended and appended, respectively, with nulls (or spaces) so that all child possibilities are 3-grams. Further, the 3-gram child possibilities must be loaded in the rollup matrix so that when the parent candidates are rolled-out, all adjacent 3-grams assembled in a parent candidate share two characters. For example, “out” in the first column will fit with “uts” in the second column, but not with “nts.”
  • In the context of OCR, the rollup function of the present invention is useful at every level of textual hierarchy. Rollup functions also avoid fatal problems often encountered by prior art string generators, which create strings from a series of possibility sets. Existing string generators suffer from three major problems. First, they are combinatorically expensive in memory use-needing a place in memory for each possible string. Second, string generators must trim strings before generating all possible strings because of limited space to store the combinatorically-many strings. Therefore, it is possible for string generators to result in higher-confidence strings being abandoned while lower-confidence strings are preserved. Third, string generators do not guarantee that strings of the same confidence, once ordered, retain that order.
  • The present invention gets around all these problems in a natural way. First, the rollup function is only geometrically expensive of memory, not combinatorically. Tables generated by prior art systems grow as L×nL, where n is the number of possibilities per possibility set and L is the number of possibility sets (i.e., the string length). There are nL strings of length L that can be generated. By comparison, the rollup matrix of the present invention grows as 2×CFmax×L2, where CFmax is the highest confidence value in any possibility set. A significant savings over prior art systems. For L=10, n=3, and CFmax=20, and allowing 1 byte per ASCII character, approximately 590,490 bytes would be required for ranking tables of prior art systems; while only 12,000 bytes are required for the rollup matrix—a savings of 98%. Second, candidate strings can be read out of a rollup table in their decreasing order of confidence without having to store unneeded strings in memory, while never skipping a higher-confidence parent candidate for a lower confidence one. The rollup matrix does not change size with the number of generated strings. Therefore, all strings are preserved and there is no trimming of strings ever required. Third, no reordering of parent strings ever takes place because the rollup matrix is unchanging. Consequently, strings of the same confidence remain in their original order.
  • Parent candidates can be read from the rollup matrix in decreasing or increasing order of parent confidence. First, a parent candidate having a desired confidence value can easily be selected from the matrix by a confidence stored in association with an entry node of the parent candidate. Parent candidates having lesser (or greater) confidences can then be read until a desired lesser (or greater) confidence level is reached. This process can be repeated until a predetermined number of parent candidates have been obtained or until all possible parent candidates have been rolled-out. The rollup function can be interrupted while reading out a parent candidate to handle some other process, such as verifying the most recently rolled-out parent candidate using a dictionary. The rollup function easily returns to where it left off in the rollup matrix to read out the next-ranked parent candidate by returning to the location in the rollup matrix that was being accessed when the interruption occurred. The rollup function of the present invention provides the above-described benefits without requiring the production of all of the parent candidates before subsequent ranking. If a particular child possibility occurs with at most one confidence value in a possibility set, then the last rolled-out string is the pointer structure. Even in the case of allowed duplication, returning to the rollup function is as simple as storing a pointer to the next entry point in the rollup matrix and storing a pointer to each position of the table, which may be accomplished by freezing the internal pointer structure.
  • The rollup function of the present invention is, of course, not limited to strings. Any parent entity can receive rollup-produced alt-sets from its child entities. For example, gene sequence information prepared from a human, an animal, a plant, or any other living organism may be parsed into its nucleotides, each of which may be represented by an alt-set. Sibling nucleotide alt-sets can then be loaded into a rollup matrix for the parent gene. In this way, the frequency of naturally-occurring nucleotide and coding sequence variations can easily be represented by the child confidences associated with child possibilities of each alt-set. Inaccuracies inherent in the gene sequencing process can be similarly represented by the child confidences.
  • Additional aspects and advantages of this invention will be apparent from the following detailed description of preferred embodiments thereof, which proceeds with reference to the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is an enlarged view of a hand printed glyph;
  • FIG. 2 is an enlarged view of a series of sibling glyphs;
  • FIG. 3 is a flow diagram depicting an OCR process for scanning, parsing, and recognizing handwritten data to create possibility sets for use with a data verification routine of the present invention;
  • FIG. 4 is a flow diagram showing detail of the data verification routine of FIG. 3 including a rollup function and dictionary routine in accordance with a preferred embodiment of the present invention;
  • FIG. 5 is a pictorial view of a three-dimensional data array in accordance with a first preferred embodiment of the present invention;
  • FIGS. 6A, 6B, 6C, and 6D are two-dimensional pictorial views of a rollup matrix in accordance with the present invention showing a loading sequence for loading the alt-sets of Table 3 into the rollup matrix;
  • FIG. 7 is an exploded three-dimensional view of the loaded rollup matrix of FIG. 6D;
  • FIGS. 8A, 8B, 8C, and 8D are show a sequence of rolling out a parent candidate from the loaded rollup matrix of FIG. 6D;
  • FIG. 9 is a diagram of an alternative embodiment of the rollup matrix of FIG. 6D including a linked list implemented in a computer memory;
  • FIG. 10 is a flow diagram showing steps taken in preparation and validation of n-gram alt-sets for loading in a rollout matrix for a parent string of the n-grams;
  • FIG. 11 is a two-dimensional pictorial view showing nested rollup matrices;
  • FIG. 12 is a flow diagram showing steps for establishing and loading of the nested rollup matrices of FIG. 11; and
  • FIG. 13 is flow diagram showing parent candidates being rolled out from the nested rollup matrices of FIG. 11.
  • DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
  • FIG. 3 is a flow diagram of an OCR process 30 in accordance with a first preferred embodiment of the present invention. With reference to FIG. 3, a document 32 bearing physical textual data is scanned using an optical scanner 34, which produces a digital pixel image of the physical data on document 32. A segmentation process 36 of the OCR process 30 receives the pixel image from the optical scanner and segments the pixel image into data segments for processing by a recognizer 38. Recognizer 38 analyzes the data segments to produce a possibility set (“pos-set”) for each data segment. Empirical uncertainty in the physical data and inaccuracies of the scanning, segmentation and recognition process are represented in the pos-sets by including multiple child possibilities in each pos-set and by assigning child confidences to the child possibilities. For example, recognizer 38 separates a parent string (as in the parent word 24 of FIG. 2) into its sibling glyphs and outputs a pos-set for each glyph. The pos-sets are output to a data verification routine 40, which uses a rollup function 60 (FIG. 4) and possibly one or more dictionaries 150 (FIG. 4) in accordance with the present invention.
  • FIG. 4 is a flow diagram of rollup function 60 of data verification routine 40 (FIG. 3). With reference to FIG. 4, a matrix initialization routine 62 of rollup function 60, receives pos-sets 64 from recognizer 38. FIG. 5 is a pictorial view of a three-dimensional data array 66, which represents a data matrix in accordance with the present invention. Data array 66, includes rows 70, columns 72, and tiers 74 that together form nodes 76. With reference to FIGS. 4 and 5, matrix initialization routine establishes a size of data array 66 based on pos-sets 64. For purposes of a simple illustration, TABLE 3 presents four sibling pos-sets.
  • TABLE 3
    poss conf poss conf poss conf poss conf
    a 2 n 1 t 1 s 1
    o 1 u 0 5 0

    A first pos-set shown in TABLE 3 includes two child possibilities, “a” and “o”, which are assigned child confidences 2 and 1, respectively. A second pos-set includes child possibilities n and u, having associated child confidences 1 and 0, respectively. And so on. The matrix initialization routine calculates a sum of the maximum confidences of the four pos-sets (2+1+1+1=5) and adds one (5+1=6) to establish a height 80 of data array 66. Data array 66, thus, includes six rows 70, having row heights R0, R1, R2, R3, R4, and R5. A width 82 of data array 66 is equal to the number of pos-sets 64. A depth 84 of data array 66 is equal to the largest number of child possibilities in any of the pos-sets 64. In this example, three of the pos-sets are equally large, having two child possibilities.
  • Once data array 66 has been established and sized, a loading routine 90 of rollup function 60 loads pos-sets 64 into data array 66. FIGS. 6A, 6B, 6C, and 6D depict a loading sequence followed by loading routine 90. With reference to FIG. 6A, a data table 92 provides a two-dimensional representation of the three-dimensional data array 66 of FIG. 5, including four columns C1, C2, C3, and C4, each of which is divided by broken lines to indicate tiers 74 of data array 66 (FIG. 5). Loading routine 90 loads the child possibilities 94 of the first pos-set into the first column C1 so that each child possibility 94 is loaded in a node 96 at a row position equal to the child confidence 98 corresponding the child possibility 94. Thus, child possibility “o”, which has an associated child confidence of one is loaded at the node located at row R1, and child possibility “a” is loaded at row R2 because it has an associated child confidence of two.
  • When loading routine 90 completes loading of the first pos-set (FIG. 6A), it proceeds to load the second pos-set into data table 92. With reference to FIG. 6B, each child possibility of the second pos-set is loaded in one node 96 of the second column (C2) for each row of the first column (C1) having filled nodes, but at a row height greater than the row height of the filled nodes 96 of column C1 by an amount equal to the child confidences being loaded. Thus, child possibility “u” having a child confidence of zero is loaded in nodes located at rows R1 and R2 of column C2, since rows R1 and R2 are filled in column C1. Child possibility “n” is loaded in nodes located at rows R2 and R3 of column C2, which are greater than the row positions of the filled nodes (R1 and R2) of column C1 by an amount equal to the child confidence (one) associated with child possibility “n.” Because the node located at C2, R2, TO, is already filled with child possibility “u”, loading routine 90 loads child possibility n at node C2, R2, Ti so that no more than one child possibility is loaded in each node.
  • Loading routine 90 then continues to load successive pos-sets 64 in sequence in successive columns, as depicted in FIGS. 6C and 6D, until all pos-sets 64 have been loaded in data table 92. As in column C2, child possibilities 94 are loaded in nodes 96 located at row positions that are greater (by an amount equal to the child confidence of the child possibility being loaded) than the row position(s) of rows of the immediately preceding column that have filled nodes. Nodes of the last column (C4) that are loaded with child possibilities contain data entities that are known as terminal elements 100.
  • FIG. 7 is an exploded view of the loaded data table 92 of FIG. 6D showing its loaded data in a three-dimensional representation in accordance with three-dimensional data array 66 of FIG. 5.
  • To extract parent candidate strings from data table 92, a roll-out routine 110 of rollup function 60 is provided (FIG. 4). FIG. 8A depicts the steps taken by roll-out routine 110, in rolling out parent candidate “ants”, i.e., the parent candidate comprising the sibling characters “a”, “n”, “t”, and “s”. Parent candidate “ants” has the greatest aggregate confidence of any of the parent candidates because its terminal element (“s”) 100 is located in the row of data table 92 having the greatest row position (R5), i.e., a maximal terminal element 112. With reference to FIG. 8A, roll-out routine 110 reads from columns C4, C3, C2, and C1, in the order opposite to which the columns were loaded. Terminal element “s” 100 (which is also the maximal terminal element 112) is read initially. Next, roll-out routine 110 reads next-to-last child element “t” 116 from the immediately previous column (C3) and from row R4, which has a row position less than the row position of terminal element “s” by the amount of the child confidence associated with terminal element “s” (i.e. one). Roll-out routine 110 prepends next-to-last child element “t” to the terminal element “s” to form a string tail of “ts.” The child confidence of one associated with next-to-last child element “t” 116 then directs roll-out routine to read prefix element “n” 118 from row R3, column C2 (because row R3 has a row position one less than the row position of R4). Roll-out routine 110 prepends prefix element “n” 118 to the string tail “ts”, to form the partial string “nts.” Element “a” 120, is then read because it is loaded in row R2, which is one less (the child confidence associated with prefix element “n” 118) than the row position of prefix element “n” 118. Element “a” 120 is prepended to complete the formation of candidate parent string “ants”. The parent confidence associated with “ants” is equal to five, which is the row position of the terminal element 100 a used to extract “ants”.
  • FIG. 8B depicts the steps taken by roll-out routine 110, in rolling out parent candidate “ant5”. With reference to FIG. 8B, terminal element “5” has an associated child confidence of zero, which directs roll-out routine to read next-to-last element “t” from the same row position (R4) in column C3. The parent confidence associated with “ant5” is equal to four, which is the row position of terminal element “5100 b used to extract “ant5”.
  • FIGS. 8C and 8D depict the steps taken by roll-out routine 110, in rolling out respective parent candidates “auts” and “onts.” Because there are two entries in row R2, column C2, roll-out routine 110 rolls out two unique parent candidates ending with terminal element “s” 100 c, both having an associated parent confidence of four, which is equal to the row height of row R4, where terminal element “s” 100 c is located.
  • In accordance with an alternative embodiment of the present invention, FIG. 9 shows the loaded data table 92 of FIGS. 6D and 7 embodied as a linked-list rollup matrix 126. With reference to FIG. 9, rollup matrix 126 includes a pointer structure 128 to nodes 96. To roll-out the parent candidate “ants”, roll-out routine 110 starts at an initial entry point 130 that includes terminal element 100 a (element “s” of maximal terminal element 112). Roll-out routine 110 then reads out elements “t” 116, “n” 118, and “a” 120 by following respective pointers 134, 136, and 138 and prepends them to element “s” 100 a. A return pointer 140 indicates to roll-out routine 110 that it has completed construction of the parent candidate. A parent confidence 141 of the parent candidate “ants” is stored in association with the terminal element “s” 100 a. All terminal elements of rollup matrix 126 serve as entry points 142 for rolling out one or more parent candidates. As in the roll-out sequences shown in FIGS. 8C and 8D, two parent candidates can be rolled out of rollup matrix 126 by beginning with terminal element “s” 100 c. A branch node 144 of rollup matrix 126 includes two pointers 146, 148, which indicate to roll-out routine 110 that two different parent candidates use branch node 144 and that roll-out routine 110 needs to execute a branch at branch node 144. Those skilled in the art will understand that more than one branch node may clearly exist in rollup matrix, and that some branch nodes will have more than two pointers (if the matrix is “deeper” than 2 tiers).
  • After rolling out of each parent candidate (typically in decreasing order of parent confidence), rollup function may output each parent candidate to a dictionary routine 150 (FIG. 4) for validation using an appropriate parser and dictionary. One embodiment of handling dictionary processing is shown in FIG. 4, and includes conditional iteration of roll-out routine 110. An iteration step 154 is conditional upon whether the parent candidate output by roll-out routine 110 passes the dictionary test (160) and, if it does, whether some other stop limit 170 has been met. For example stop limit 170 may trigger OCR process 30 (FIG. 3) to terminate verification of the parent element represented by rollup matrix 126 (and rollup table 92), and to load the next series of pos-sets scanned and recognized from document 32.
  • FIG. 10 is a flow diagram showing steps taken in preparation and validation of n-gram alt-sets for loading in a rollout matrix for a parent string of the n-grams. With reference to FIG. 10, an n-gram verification process 200 receives pos-sets from OCR system (step 210) and assembles them in computer memory to form a ranked list of n-gram candidates (step 212). N-gram candidates within a single ranked list may have different lengths, for example when one of the pos-sets includes both an “m” possibility and an “rn” possibility. To accommodate n-gram candidates having different lengths, a length gage routine 214 of n-gram verification process 200 determines the length of each n-gram candidate. The n-gram candidates are then processed by an appropriate n-gram dictionary 216. N-gram dictionary 216 is a specialized dictionary or collection of specialized dictionaries that includes information about frequency of occurrence of n-grams (for example 2-grams, 3-grams, etc.) in written language or some subset of written language. N-gram dictionary 216 assigns an n-gram confidence to each n-gram candidate based on (i) the dictionary frequency rating for the n-gram and (ii) a child confidence associated with a central character of the n-gram candidate. N-gram and its associated n-gram confidence are then appended to an n-gram alt-set (step 218). Steps 214, 216, and 218 are then repeated until all of the lists of n-gram parent candidates have been processed through the dictionary and output as n-gram alt-sets. After all n-gram alt-sets have been completed, a string-sized rollup matrix is built using the alt-sets as sibling entities (step 220). Parent string candidates can then be rolled out of string-sized rollup matrix in ranked order (step 222) and processed using a string dictionary (step 224) before outputting ranked parent strings (step 226).
  • FIG. 11 is a two-dimensional pictorial view showing nested rollup matrices 240 established in accordance with the present invention. With reference to FIG. 11, nested rollup matrices 240 include a child rollup matrix 250 nested within a parent rollup matrix 260. Child rollup matrix 250 is said to be “nested” because complete candidates that may be rolled out of child rollup matrix 250 are referenced by pointers within parent rollup matrix 260. In this example, child rollup matrix 250 represents candidate city names in a typical rollup matrix in accordance with the present invention. However, any child entity can be represented in a nested child rollup matrix. Parent rollup matrix 260 is a typical rollup matrix in accordance with the present invention. In this example, parent rollup matrix 260 includes sibling city, state, and zip-code alt-sets. First and second city nodes 262, 264 of parent rollup matrix 260 include respective first and second city pointers 266, 268 to respective first and second entry points 270, 272 of child rollup matrix 250. First and second entry points 270, 272 are terminal nodes of child rollup matrix 250 having associated city confidences 274, 276. While the nested rollup matrices 240 of FIG. 11 include only one nested child matrix, it would be straightforward to nest multiple child matrices within a single parent rollup matrix. Likewise, it would be simple to create a hierarchy of nested rollup matrices including three or more layers of rollup matrices, rather than the two layers (child rollup matrix 250 and parent rollup matrix 260) of FIG. 11.
  • In setting up nested rollup matrices 240, child rollup matrix 250 is established before establishing parent rollup matrix 260. This order of establishing nested rollup matrices 240 insures that city confidences 274, 276 of child rollup matrix 250 may be taken into account when establishing, sizing, and loading parent rollup matrix 260. When loading first and second city pointers 266, 268 in parent rollup matrix 260, city confidences 274, 276 of child rollup matrix 250 determine how parent rollup matrix 260 is loaded.
  • FIG. 12 is a flow diagram showing steps for establishing and loading of the nested rollup matrices of FIG. 11. With reference to FIG. 12, a child rollup matrix is first established and loaded (step 300). Once loaded, entry points for child candidates of the child rollup matrix, and their associated child confidences are available. These child candidates, entry points, and child confidences are then taken into account in establishing and sizing parent rollup matrix (step 310). Parent rollup matrix is then loaded (step 320). In the example of FIG. 11, parent rollup matrix 260 is loaded with a zip-code (postal code) alt-set in its terminal column and a state alt-set in its next-to-last column. Parent rollup matrix is also loaded with city pointers 266, 268 to appropriate entry points 270, 272 of child rollup matrix 250. After parent rollup matrix has been loaded (step 320), ranked parent candidates may then be rolled out (step 330) for processing by a dictionary. The dictionary required for use with the nested rollup matrices 240 shown in the example of FIG. 11 would be a city-state-zip dictionary for verifying specific city-state-zip combinations.
  • FIG. 13 is flow diagram showing a sequence of steps for rolling out a parent candidate from the nested rollup matrices 240 of FIG. 11. With reference to FIG. 13, a nested roll-out routine 400 starts at an entry point, which is a terminal parent node of a linked list of parent matrix (step 410). All subsequent steps shown in FIG. 13 are identical regardless of whether the current node is a terminal node or another node of nested rollup matrices 240. Nested roll-out routine 400 next determines whether the parent node includes a pointer to a nested child matrix (step 420). If not, then nested roll-out routine 400 reads the element stored in the current node (step 430) and prepends it to a parent candidate tail. Nested roll-out routine 400, then determines whether the node includes a return pointer that would indicate completion of the parent candidate (step 440). If not, then nested roll-out routine advances to the next node in the linked list (step 450) and returns to step 420. If a parent node includes a nested matrix pointer to a nested rollup matrix (at step 410) then nested roll-out routine 400 proceeds to store in memory an address of the parent node that includes the nested matrix pointer (step 460). Nested roll-out routine 400, then rolls out a child candidate from the nested child matrix (step 470), prepends the child candidate to the parent candidate tail (step 480). Nested roll-out routine then restores the address of the last-read parent node, which was previously stored in memory and returns to the parent rollup function (step 490), continuing on at the last read parent node.
  • When a parent node includes a return pointer (step 440), nested roll-out routine completes its assembly of parent candidate and processes it using dictionary process 500. If the parent candidate passes the dictionary test, it is output. The nested roll-out function can be repeated for each terminal node of parent roll-out matrix to complete roll out of all parent candidates.
  • It will be obvious to those having skill in the art that many changes may be made to the details of the above-described embodiments of this invention without departing from the underlying principles thereof. The scope of the present invention should, therefore, be determined only by the following claims.

Claims (9)

1. A computer-implemented system for organizing a set of sibling entities each having one or more child possibilities, at least one of the sibling entities including multiple child possibilities having a relative rank or confidence value and_from which multiple parent candidates can be generated, each of the parent candidates having a relative rank, and for generating an ordered series of parent candidates from the child possibilities, comprising:
a means for initializing a plurality of nodes in a computer-readable data storage medium for storing the child possibilities of the sibling entities;
a means for loading the sibling entities into the nodes to form a rollup matrix having an organization that represents the relative ranking of the parent candidates; and
a means for reading from the nodes to generate a series of parent candidates in order of their ranking.
2. The system of claim 1, further comprising:
a means for calculating a parent candidate confidence for at least some of the parent candidates;
a means for storing the parent candidate confidences in the rollup matrix in association with the corresponding parent candidates; and
in which the means for reading from the nodes generates the series of parent candidates based on the stored parent candidate confidences.
3. The system of claim 1, further comprising a means for comparing the generated parent candidates against a dictionary.
4. The system of claim 1 in which:
at least one of the sibling entities includes a nested child matrix having an entry point; and
the means for loading includes a means for loading the nested child matrix into one or more of the nodes, a means for creating a pointer to the entry point, and a means for storing the pointer in the rollup matrix.
5. A computer-implemented method for organizing a set of sibling entities each having one or more child possibilities, at least one of the sibling entities including multiple child possibilities having a relative rank or confidence value and from which multiple parent candidates can be generated, each of the parent candidates having a relative rank, and for generating an ordered series of parent candidates from the child possibilities, comprising:
initializing a plurality of nodes in a computer-readable data storage medium for storing the child possibilities of the sibling entities;
loading the sibling entities into the nodes to form a rollup matrix having an organization that represents the relative ranking of the parent candidates; and
reading from the nodes to generate a series of parent candidates in order of their ranking; and outputting at least one of the parent candidates.
6. The method of claim 5, further comprising:
calculating a parent candidate confidence for at least some of the parent candidates;
storing the parent candidate confidences in the rollup matrix in association with the corresponding parent candidates; and
reading from the nodes generates the series of parent candidates based on the stored parent candidate confidences.
7. The method of claim 5, further comprising comparing the generated parent candidates against a dictionary.
8. The method of claim 5 in which:
at least one of the sibling entities includes a nested child matrix having an entry point; and
the loading of the sibling entities into the nodes includes loading the nested child matrix into one or more of the nodes, creating a pointer to the entry point, and storing the pointer in the rollup matrix.
9. A method for character recognition in an OCR system, the method comprising:
optically scanning a document to obtain data defining an image;
segmenting the image to determine a plurality of sibling glyphs;
each sibling glyph comprising an associated possibility set, the possibility set consisting of at least one alphanumeric character candidate information pair, each pair consisting of a respective candidate and an associated confidence value;
identifying a plurality of parent candidates based on the sibling glyphs, each parent candidate representing a candidate word;
calculating a parent candidate confidence value for at least some of the parent candidates;
storing the parent candidate confidences in a rollup matrix in association with the corresponding parent candidates; and
reading from the nodes so as to generate a series of parent candidate words based on the stored parent candidate confidence values.
US12/106,779 1999-03-19 2008-04-21 Rollup functions for efficient storage, presentation, and analysis of data Abandoned US20080228469A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/106,779 US20080228469A1 (en) 1999-03-19 2008-04-21 Rollup functions for efficient storage, presentation, and analysis of data

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US12535299P 1999-03-19 1999-03-19
US12525799P 1999-03-19 1999-03-19
US09/528,749 US6597809B1 (en) 1999-03-19 2000-03-20 Rollup functions for efficient storage presentation and analysis of data
US10/410,015 US7379603B2 (en) 1999-03-19 2003-04-08 Rollup functions and methods
US12/106,779 US20080228469A1 (en) 1999-03-19 2008-04-21 Rollup functions for efficient storage, presentation, and analysis of data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/410,015 Continuation US7379603B2 (en) 1999-03-19 2003-04-08 Rollup functions and methods

Publications (1)

Publication Number Publication Date
US20080228469A1 true US20080228469A1 (en) 2008-09-18

Family

ID=26823413

Family Applications (3)

Application Number Title Priority Date Filing Date
US09/528,749 Expired - Lifetime US6597809B1 (en) 1999-03-19 2000-03-20 Rollup functions for efficient storage presentation and analysis of data
US10/410,015 Expired - Lifetime US7379603B2 (en) 1999-03-19 2003-04-08 Rollup functions and methods
US12/106,779 Abandoned US20080228469A1 (en) 1999-03-19 2008-04-21 Rollup functions for efficient storage, presentation, and analysis of data

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US09/528,749 Expired - Lifetime US6597809B1 (en) 1999-03-19 2000-03-20 Rollup functions for efficient storage presentation and analysis of data
US10/410,015 Expired - Lifetime US7379603B2 (en) 1999-03-19 2003-04-08 Rollup functions and methods

Country Status (3)

Country Link
US (3) US6597809B1 (en)
AU (1) AU3907300A (en)
WO (1) WO2000057350A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035211A1 (en) * 2009-08-07 2011-02-10 Tal Eden Systems, methods and apparatus for relative frequency based phrase mining
US20110158548A1 (en) * 2009-12-29 2011-06-30 Omron Corporation Word recognition method, word recognition program, and information processing device
US20140355835A1 (en) * 2013-05-28 2014-12-04 Xerox Corporation System and method for ocr output verification

Families Citing this family (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000057350A1 (en) * 1999-03-19 2000-09-28 Raf Technology, Inc. Rollup functions for efficient storage, presentation, and analysis of data
JP2001052116A (en) * 1999-08-06 2001-02-23 Toshiba Corp Device and method for matching pattern stream, device and method for matching character string
WO2001035667A1 (en) 1999-11-10 2001-05-17 Launch Media, Inc. Internet radio and broadcast method
US6389467B1 (en) 2000-01-24 2002-05-14 Friskit, Inc. Streaming media search and continuous playback system of media resources located by multiple network addresses
US7162482B1 (en) 2000-05-03 2007-01-09 Musicmatch, Inc. Information retrieval engine
US8352331B2 (en) 2000-05-03 2013-01-08 Yahoo! Inc. Relationship discovery engine
US7251665B1 (en) * 2000-05-03 2007-07-31 Yahoo! Inc. Determining a known character string equivalent to a query string
US8271333B1 (en) 2000-11-02 2012-09-18 Yahoo! Inc. Content-related wallpaper
US7707221B1 (en) 2002-04-03 2010-04-27 Yahoo! Inc. Associating and linking compact disc metadata
CN1875377A (en) 2003-09-10 2006-12-06 音乐匹配公司 Music purchasing and playing system and method
US20050112530A1 (en) * 2003-11-25 2005-05-26 International Business Machines Corporation Computer-implemented method, system and program product for performing branched rollup for shared learning competencies in a learning environment
US7302441B2 (en) * 2004-07-20 2007-11-27 International Business Machines Corporation System and method for gradually bringing rolled in data online with incremental deferred integrity processing
EP1854048A1 (en) * 2005-02-28 2007-11-14 ZI Decuma AB Recognition graph
US7565491B2 (en) * 2005-08-04 2009-07-21 Saffron Technology, Inc. Associative matrix methods, systems and computer program products using bit plane representations of selected segments
WO2007037710A2 (en) * 2005-09-30 2007-04-05 Manabars Ip Limited A computational device for the management of sets
US8340430B2 (en) * 2007-07-10 2012-12-25 Sharp Laboratories Of America, Inc. Methods and systems for identifying digital image characteristics
US8160365B2 (en) * 2008-06-30 2012-04-17 Sharp Laboratories Of America, Inc. Methods and systems for identifying digital image characteristics
US8306327B2 (en) * 2008-12-30 2012-11-06 International Business Machines Corporation Adaptive partial character recognition
JP5371565B2 (en) * 2009-06-15 2013-12-18 キヤノン株式会社 Data processing apparatus, data processing method, and program
US9443298B2 (en) 2012-03-02 2016-09-13 Authentect, Inc. Digital fingerprinting object authentication and anti-counterfeiting system
US8774455B2 (en) 2011-03-02 2014-07-08 Raf Technology, Inc. Document fingerprinting
CN103268490B (en) * 2013-05-30 2016-01-13 电子科技大学 A kind of digit recognition method adopting both sides three quant's sign
WO2014204339A1 (en) * 2013-06-18 2014-12-24 Abbyy Development Llc Methods and systems that generate feature symbols with associated parameters in order to convert document images to electronic documents
RU2643465C2 (en) * 2013-06-18 2018-02-01 Общество с ограниченной ответственностью "Аби Девелопмент" Devices and methods using a hierarchially ordered data structure containing unparametric symbols for converting document images to electronic documents
US20160188541A1 (en) * 2013-06-18 2016-06-30 ABBYY Development, LLC Methods and systems that convert document images to electronic documents using a trie data structure containing standard feature symbols to identify morphemes and words in the document images
US10037537B2 (en) 2016-02-19 2018-07-31 Alitheon, Inc. Personal history in track and trace system
EP3236401A1 (en) 2016-04-18 2017-10-25 Alitheon, Inc. Authentication-triggered processes
US10740767B2 (en) 2016-06-28 2020-08-11 Alitheon, Inc. Centralized databases storing digital fingerprints of objects for collaborative authentication
US10915612B2 (en) 2016-07-05 2021-02-09 Alitheon, Inc. Authenticated production
US10902540B2 (en) 2016-08-12 2021-01-26 Alitheon, Inc. Event-driven authentication of physical objects
US10839528B2 (en) 2016-08-19 2020-11-17 Alitheon, Inc. Authentication-based tracking
US10176399B1 (en) * 2016-09-27 2019-01-08 Matrox Electronic Systems Ltd. Method and apparatus for optical character recognition of dot text in an image
US10176400B1 (en) 2016-09-27 2019-01-08 Matrox Electronic Systems Ltd. Method and apparatus for locating dot text in an image
US10192132B1 (en) 2016-09-27 2019-01-29 Matrox Electronic Systems Ltd. Method and apparatus for detection of dots in an image
US10223618B1 (en) 2016-09-27 2019-03-05 Matrox Electronic Systems Ltd. Method and apparatus for transformation of dot text in an image into stroked characters based on dot pitches
US11062118B2 (en) 2017-07-25 2021-07-13 Alitheon, Inc. Model-based digital fingerprinting
EP3514715A1 (en) 2018-01-22 2019-07-24 Alitheon, Inc. Secure digital fingerprint key object database
US10963670B2 (en) 2019-02-06 2021-03-30 Alitheon, Inc. Object change detection and measurement using digital fingerprints
EP3734506A1 (en) 2019-05-02 2020-11-04 Alitheon, Inc. Automated authentication region localization and capture
EP3736717A1 (en) 2019-05-10 2020-11-11 Alitheon, Inc. Loop chain digital fingerprint method and system
US11238146B2 (en) 2019-10-17 2022-02-01 Alitheon, Inc. Securing composite objects using digital fingerprints
EP3859603A1 (en) 2020-01-28 2021-08-04 Alitheon, Inc. Depth-based digital fingerprinting
EP3885982A3 (en) 2020-03-23 2021-12-22 Alitheon, Inc. Hand biometrics system and method using digital fingerprints
EP3885984A1 (en) 2020-03-23 2021-09-29 Alitheon, Inc. Facial biometrics system and method of using digital fingerprints
US11948377B2 (en) 2020-04-06 2024-04-02 Alitheon, Inc. Local encoding of intrinsic authentication data
US11663849B1 (en) 2020-04-23 2023-05-30 Alitheon, Inc. Transform pyramiding for fingerprint matching system and method
US11700123B2 (en) 2020-06-17 2023-07-11 Alitheon, Inc. Asset-backed digital security tokens

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4718102A (en) * 1983-01-19 1988-01-05 Communication Intelligence Corporation Process and apparatus involving pattern recognition
US5329609A (en) * 1990-07-31 1994-07-12 Fujitsu Limited Recognition apparatus with function of displaying plural recognition candidates
US5644652A (en) * 1993-11-23 1997-07-01 International Business Machines Corporation System and method for automatic handwriting recognition with a writer-independent chirographic label alphabet
US5710916A (en) * 1994-05-24 1998-01-20 Panasonic Technologies, Inc. Method and apparatus for similarity matching of handwritten data objects
US5768451A (en) * 1993-12-22 1998-06-16 Hitachi, Ltd Character recognition method and apparatus
US5774588A (en) * 1995-06-07 1998-06-30 United Parcel Service Of America, Inc. Method and system for comparing strings with entries of a lexicon
US5802205A (en) * 1994-09-09 1998-09-01 Motorola, Inc. Method and system for lexical processing
US5805911A (en) * 1995-02-01 1998-09-08 Microsoft Corporation Word prediction system
US5835635A (en) * 1994-09-22 1998-11-10 Interntional Business Machines Corporation Method for the recognition and completion of characters in handwriting, and computer system
US5963666A (en) * 1995-08-18 1999-10-05 International Business Machines Corporation Confusion matrix mediated word prediction
US6205261B1 (en) * 1998-02-05 2001-03-20 At&T Corp. Confusion set based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique
US6400805B1 (en) * 1998-06-15 2002-06-04 At&T Corp. Statistical database correction of alphanumeric identifiers for speech recognition and touch-tone recognition
US6597809B1 (en) * 1999-03-19 2003-07-22 Raf Technology, Inc. Rollup functions for efficient storage presentation and analysis of data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442350A (en) * 1992-10-29 1995-08-15 International Business Machines Corporation Method and means providing static dictionary structures for compressing character data and expanding compressed data

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4718102A (en) * 1983-01-19 1988-01-05 Communication Intelligence Corporation Process and apparatus involving pattern recognition
US5329609A (en) * 1990-07-31 1994-07-12 Fujitsu Limited Recognition apparatus with function of displaying plural recognition candidates
US5644652A (en) * 1993-11-23 1997-07-01 International Business Machines Corporation System and method for automatic handwriting recognition with a writer-independent chirographic label alphabet
US5768451A (en) * 1993-12-22 1998-06-16 Hitachi, Ltd Character recognition method and apparatus
US5710916A (en) * 1994-05-24 1998-01-20 Panasonic Technologies, Inc. Method and apparatus for similarity matching of handwritten data objects
US5802205A (en) * 1994-09-09 1998-09-01 Motorola, Inc. Method and system for lexical processing
US5835635A (en) * 1994-09-22 1998-11-10 Interntional Business Machines Corporation Method for the recognition and completion of characters in handwriting, and computer system
US5805911A (en) * 1995-02-01 1998-09-08 Microsoft Corporation Word prediction system
US5774588A (en) * 1995-06-07 1998-06-30 United Parcel Service Of America, Inc. Method and system for comparing strings with entries of a lexicon
US5963666A (en) * 1995-08-18 1999-10-05 International Business Machines Corporation Confusion matrix mediated word prediction
US6205261B1 (en) * 1998-02-05 2001-03-20 At&T Corp. Confusion set based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique
US6400805B1 (en) * 1998-06-15 2002-06-04 At&T Corp. Statistical database correction of alphanumeric identifiers for speech recognition and touch-tone recognition
US6597809B1 (en) * 1999-03-19 2003-07-22 Raf Technology, Inc. Rollup functions for efficient storage presentation and analysis of data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035211A1 (en) * 2009-08-07 2011-02-10 Tal Eden Systems, methods and apparatus for relative frequency based phrase mining
US20110158548A1 (en) * 2009-12-29 2011-06-30 Omron Corporation Word recognition method, word recognition program, and information processing device
US8855424B2 (en) * 2009-12-29 2014-10-07 Omron Corporation Word recognition method, word recognition program, and information processing device
US20140355835A1 (en) * 2013-05-28 2014-12-04 Xerox Corporation System and method for ocr output verification
US9384423B2 (en) * 2013-05-28 2016-07-05 Xerox Corporation System and method for OCR output verification

Also Published As

Publication number Publication date
AU3907300A (en) 2000-10-09
WO2000057350A1 (en) 2000-09-28
US20030190077A1 (en) 2003-10-09
US7379603B2 (en) 2008-05-27
US6597809B1 (en) 2003-07-22

Similar Documents

Publication Publication Date Title
US7379603B2 (en) Rollup functions and methods
US4991094A (en) Method for language-independent text tokenization using a character categorization
US5655129A (en) Character-string retrieval system and method
JP3077765B2 (en) System and method for reducing search range of lexical dictionary
US6721451B1 (en) Apparatus and method for reading a document image
US7240062B2 (en) System and method for creating a searchable word index of a scanned document including multiple interpretations of a word at a given document location
US7769778B2 (en) Systems and methods for validating an address
JP3302988B2 (en) Character processing method and character identification method
EP0764305B1 (en) System and method for portable document indexing using n-gram word decomposition
US7359851B2 (en) Method of identifying the language of a textual passage using short word and/or n-gram comparisons
CN1122243C (en) Automatic language identification system for multilingual optical character recognition
US6643647B2 (en) Word string collating apparatus, word string collating method and address recognition apparatus
EP1559061A2 (en) Post-processing system and method for correcting machine recognized text
KR100459832B1 (en) Systems and methods for indexing portable documents using the N-GRAMWORD decomposition principle
WO2009005492A1 (en) Systems and methods for validating an address
CN112417851A (en) Text error correction word segmentation method and system and electronic equipment
RU2166207C2 (en) Method for using auxiliary data arrays in conversion and/or verification of character-expressed computer codes and respective subpictures
JPH11328318A (en) Probability table generating device, probability system language processor, recognizing device, and record medium
KR950001059B1 (en) Korean character address recognition method and apparatus
CN116070596B (en) PDF file generation method and device based on dynamic data and related medium
CN114547151A (en) Company name matching method
JPH05505270A (en) A fast approximate string matching method for multiple error spelling correction
JP4389332B2 (en) Machine translation analysis result selection device
Takasu An approximate string match for garbled text with various accuracy
CN117195875A (en) Data processing method, terminal and storage medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: RAF TECHNOLOGY, INC., WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ROSS, DAVID J.;BILLESTER, STEPHEN E.M.;SMITH, BRENT R.;REEL/FRAME:020856/0662

Effective date: 20000615

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: MATTHEWS INTERNATIONAL CORPORATION, PENNSYLVANIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:RAF TECHNOLOGY, INC.;REEL/FRAME:043976/0297

Effective date: 20170228