WO2003091828A2 - Method and system for searching documents with numbers - Google Patents

Method and system for searching documents with numbers Download PDF

Info

Publication number
WO2003091828A2
WO2003091828A2 PCT/GB2003/001482 GB0301482W WO03091828A2 WO 2003091828 A2 WO2003091828 A2 WO 2003091828A2 GB 0301482 W GB0301482 W GB 0301482W WO 03091828 A2 WO03091828 A2 WO 03091828A2
Authority
WO
WIPO (PCT)
Prior art keywords
query
document
documents
numbers
distance
Prior art date
Application number
PCT/GB2003/001482
Other languages
French (fr)
Other versions
WO2003091828A3 (en
Inventor
Rakesh Agrawal
Ramakrishnan Srikant
Original Assignee
International Business Machines Corporation
Ibm United Kingdom Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation, Ibm United Kingdom Limited filed Critical International Business Machines Corporation
Priority to AU2003222598A priority Critical patent/AU2003222598A1/en
Publication of WO2003091828A2 publication Critical patent/WO2003091828A2/en
Publication of WO2003091828A3 publication Critical patent/WO2003091828A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99937Sorting

Definitions

  • the present invention relates to searching documents using numbers.
  • a method for searching documents including receiving a query that has at least one query number q t but that does not necessarily specify an attribute name or unit for the query number.
  • each query number q t is matched with at most one document number n m and vice-versa, with each document number being a number contained in the document or infinity, such that a distance score is minimised.
  • the method returns at least one document in response to the query.
  • the present invention provides for accessing the documents over the Internet .
  • the present invention provides for the distance score being derived from a distance function that may require, as input, only numbers.
  • the distance function may also account for attribute names and/or units that might be associated with the query numbe s.
  • the matching of query numbers to document numbers can be undertaken using a bipartite graph or using heuristic rules or by other means.
  • the present invention provides for the method to include limiting the number of documents that are processed using lower bounds on the distance scores.
  • the method can process the documents to support database access and index access and can create, for each query number q t , an ordered list L. of document numbers n. Round-robin access on the lists L. can be performed.
  • a database access for the document is executed and the document numbers matched as described above.
  • the round-robin access is halted when a number of documents have been detected having distance scores less than a threshold value t.
  • the present invention provides for a computer being programmed with instructions for retrieving numbers from a corpus of documents.
  • the instructions include, in response to a numeric query containing at least one numeric query string that is not necessarily associated with an attribute name or unit name, accessing at least some documents in the corpus.
  • the instructions further include comparing each numeric query string with one or more document strings. Each numeric query string is associated with one and only one document string to optimize at least one distance function, with documents being returned based thereon.
  • the present invention provides for a computer program device having computer readable code thereon for searching a set of documents .
  • the code includes means for receiving a user query including at least one query number corresponding to respective desired attribute values .
  • the query does not necessarily include attribute names or unit names. Means return documents with values close to the query number, using only the query number.
  • the present invention provides for a computer for searching a set of unstructured documents (e.g., Web pages, specification sheets) and/or semi-structured documents (e.g., spreadsheets with column labels or text documents with variations in style or formatting) and/or structured documents (e.g., database tables including those that have been converted to XML) includes instructions for receiving a user query consisting of a set of numbers corresponding to desired attribute values, and also consisting of respective attribute names of the numbers. Instructions are provided for returning documents containing document values close to the set of numbers in the query using the query numbers and attribute names.
  • unstructured documents e.g., Web pages, specification sheets
  • semi-structured documents e.g., spreadsheets with column labels or text documents with variations in style or formatting
  • structured documents e.g., database tables including those that have been converted to XML
  • the present invention provides for a computer- implemented method for searching a set of documents in response to a user query including a set of query numbers corresponding to desired attribute values together with units of the values includes returning documents with values approximating the set of query numbers, using the query numbers and units.
  • the present invention provides for a system for searching a set of documents includes means for receiving a query comprising at least one query number, at least one unit of the query number, and at least one attribute name associated with the query number. Means are provided for returning documents with at least one document value approximating the query number, using the query number, the attribute name, and the unit. Means are provided for the comparing and associating instructions use at least one bipartite graph.
  • Figure 1 is a block diagram of the present system architecture
  • Figure 2 is a flow chart of the preferred document matching logic
  • Figure 3 is a flow chart of the logic for limiting the set of the documents that are to be matched.
  • a system for searching documents, including unstructured and semi-structured documents, in one or more databases 12 (only a single database 12 shown for clarity) using numeric queries.
  • a computer 14 for executing the queries accesses the database 12 over a network 15.
  • the network 15 can be the Internet.
  • the computer 14 can include an input device 16, such as a keyboard or mouse, for inputting data to the computer 14, as well as an output device 18, such as a monitor.
  • the computer 14 can be a personal computer made by International Business Machines Corporation (IBM) of Armonk, N.Y. that can have, by way of non-limiting example, a 933 MHz Pentium ® III processor with 512 MB of memory.
  • IBM International Business Machines Corporation
  • digital processors such as a laptop computer, mainframe computer, palmtop computer, personal assistant, or any other suitable processing apparatus such as but not limited to a Sun ® HotspotTM server.
  • input devices including keypads, trackballs, and voice recognition devices can be used, as can other output devices, such as printers, other computers or data storage devices, and computer networks.
  • the processor of the computer 14 accesses a query module 20 to undertake certain of the logic of the present invention.
  • the module 20 may be executed by a processor as a series of computer-executable instructions.
  • the instructions may be contained on a data storage device with a computer readable medium, such as a computer diskette having a computer usable medium with code elements stored thereon.
  • the instructions may be stored on random access memory (RAM) of the computer 14, on a DASD array, or on magnetic tape, conventional hard disk drive, electronic read-only memory, optical storage device, or other appropriate data storage device.
  • the computer-executable instructions may be lines of C++ code or JAVA.
  • the flow charts herein illustrate the structure of the logic of the present invention as embodied in computer program software.
  • the flow charts illustrate the structures of computer program code elements including logic circuits on an integrated circuit, that function according to this invention.
  • the invention is practised in its essential embodiment by a machine component that renders the program code elements in a form that instructs a digital processing apparatus (that is, a computer) to perform a sequence of function steps corresponding to those shown.
  • the query-document matching logic of the present invention can be seen. It is to be understood that the present logic works best, but not necessarily exclusively, with data spaces that exhibit low reflectivity, i.e., that, for a point x t in the data space that is described by one or more numbers, the data space does not contain very many permutations of the numbers. For example, assume computer memory capacity in a data space typically has values of 32, 64, 126, 256, or 512 megabytes, and further assume that disk space has values of that might appear in the same space of 10, 20, or 30 gigabytes. Under this hypothetical, the only area that might result in confusion if the axes were overlaid on each other or exchanged is that the memory value of "32" might be confused as being close to the disk space value of "30".
  • each query number is matched with a document number "n" such that a distance score generated by a distance function, such as the preferred, non-limiting function set forth below, is minimized.
  • each query number is matched with one and only one document number, and each document number likewise preferably is matched with one and only one query number .
  • some query numbers can be matched with infinity in accordance with the disclosure below, which means that the score for that document will be infinity and consequently that the document is highly unlikely to be returned in the results set. This process is repeated for each document to be tested, and the highest ranking documents are returned at block 30.
  • F(Q,D) ([summation of i from 1 to k]w(q ⁇ , n : ⁇ ) p ) 1 p , wherein D is a document that includes m numbers n 1( Q is a query that includes k numbers q l7 ... ,q k such that there is a set of matching document numbers n :1 , ... ,n :k in the j ch document, w(q., nji) is the distance between qi and j i, and F (Q,D) consequently is the distance score output by the distance function F with the L p norm (l ⁇ p ⁇ infinity) .
  • the above logic is undertaken by constructing a weighted bipartite graph having k source vertices labelled q-.,-.-,qit, corresponding to the k number strings in the query, and m target vertices labelled n 1( ...,n m , corresponding to the m number strings in the document. If there are more query numbers than document numbers (i.e., m ⁇ k) , (k-m) target vertices are added with values of infinity.
  • an edge is constructed to the k closest target vertices and assigned a weight w(q 1( n.
  • the edge weights resulting in the lowest distance score are used to define the one-to-one matching between each source vertex (query number) and a corresponding target vertex (document number) .
  • heuristic algorithms i.e., dynamic programming, can be used as set forth further below.
  • F(Q,D) ([summation of i from 1 to k]w(qi, n,i) + [B x v(Ai, H i t ) p ] + [B u x
  • H. is the set of attribute names associated with the number n 1( Ai is the set of attribute names associated with the query number q i( v(A,, H : is a function that determines the distance between set of attribute names associated with the query number and the set of attribute names associated with the document number (preferably, but not necessarily, equal to 0 if Ai3H g f or if A.
  • B f and 1 otherwise)
  • B is a user-defined parameter that balances the importance between the match on the numbers and the match on the attribute names
  • Hj u is the set of unit names associated with the number n j( Ai u is the set of unit names associated with the query number q 1( u(Ai u , H : u ) is a function that determines the distance between the two sets
  • B u is a user-defined parameter that balances the importance between the match on the numbers and the match on the unit names.
  • B can be determined as follows. Suppose a web site provides enough summary with each answer that the user is likely to click on relevant answers. By tracking clicks, a set of queries can be obtained for which the true answers (with high probability) are known. This set can be used to "tune” the algorithm above by using different values for B and selecting the one that yields the highest accuracy against the tune set. A tune set per query size might be required. Additionally, in the presence of a "hint” such as attribute names, the weight of edge (q 1# n-j) in the bipartite graph become w(q i( n j ) p + B x v(A i# H j ) p .
  • Figure 3 shows the logic that can be used if desired to limit the number of documents that must be processed.
  • the documents are processed to support database access for each document (wherein given a document ID, the multi-set of numbers present in the document are returned) , and index access, (wherein given a number, the set of documents having that number is returned) .
  • index access the numbers are kept sorted and a B-tree can be used if desired if the index becomes excessively large.
  • block 34 indicates that if desired, the documents can be further processed to account for "hints", i.e., for the presence of attribute and/or unit names, wherein a hint access index is created such that, given a number together with, e.g., its attribute name, the set of documents is returned in which the number is present and the number's attribute name (in the document) is included in the set of attribute names associated with the number.
  • hints i.e., for the presence of attribute and/or unit names
  • a list Li for the query number q. is defined to be the documents obtained from index lookup of riiSni 2 , ... sorted in ascending value of score, lowest score first. These lists need not be materially realized.
  • a round-robin access to each of the k sorted lists L is executed.
  • a database access is executed for the document and the document is processed in accordance with the logic of Figure 2 to render a document distance score F(Q,D).
  • n t ' is the number last looked at from the index for the query number qi.
  • a threshold value t is defined. Moving to decision diamond 46, it is determined whether t documents have been discovered with scores less than the threshold value t. If so, the round-robin is halted at block 48 and those documents having distance scores less than t are returned, if they haven't been already, as the results set. If the test at decision diamond 46 is negative the logic loops back to block 40 to retrieve the next document in the round-robin for test.
  • the threshold value t can be defined to be ([summation of i from 1 to k] w(q i# ni') p ) 1 p .
  • the closest number to each query term qi must be at least as far from q t as is ni ' , and hence the distance between the document and the query must be at least as high as the threshold value t.
  • the Si score is the lower bound on the distance between query and document numbers, not necessarily the actual distance.
  • the threshold value t can be defined to be ([summation of i from 1 to k] w(qi, ni') p ) + [B x v(Ai, ai') p ] 1 p , wherein n 4 ' , a 4 ' is the entry in the index last looked at for the query term q t .
  • bipartite graph is one method for matching query numbers to document numbers
  • the present invention is not necessarily limited to a single method.
  • a fast heuristic that obtains good matching by exploiting the fact that the user query is likely to contain a small number of terms is given below.
  • the numbers present in a document can be sorted and saved. This preprocessing can be performed for every document in the database.
  • the query numbers are sorted at run-time. For query numbers that match document numbers, the two sorted lists are traversed, making greedy best assignments for each query number.
  • a binary search can be performed within the sorted document numbers to locate the closest match for each query number, and then a scan to either side can be executed if that number has already been matched with an earlier query number. To reduce execution time, the index of the last free value before the current match can be recorded.

Abstract

A system and method for using numbers to query a corpus of documents, particularly but not exclusively for data spaces that have low reflectivity, i.e., for a point xi described by one or more numbers, the data space does not contain very many permutations of the numbers. For each document to be searched, each query number is matched with one and only one document number preferably using a bipartite graph or heuristic rule such that a distance function is minimised. The distance function can, but not must, take into account attribute names and unit names. A limiting algorithm can be used to limit the number of documents that must be searched.

Description

METHOD AND SYSTEM FOR SEARCHING DOCUMENTS WITH NUMBERS
Field of the Invention
The present invention relates to searching documents using numbers.
Background of the Invention
A large fraction of data on the Web is numeric, yet current search techniques are actually quite primitive at searching for numbers.
Essentially, if a person desires information on a numeric query, the person must enter exactly the number desired, because all current search engines do is treat numbers as character strings to find exact matches. To illustrate the problem, consider the number 6798.32, which, when input to current search engines, correctly returns pages relating to the lunar nutation cycle, but when input as 6798.320 produces no page at all in response to the numeric query. As another example, consider a person who wants to find a specification sheet on a particular semiconductor device that has a set-up speed of 18 nanoseconds at a power rating of 495 mW. If the exact numeric values are provided many search engines can find matching character strings and thereby return a relevant page, but if the user can only supply approximate values, e.g., a query of "20 nanoseconds, 500 mW" , current search engines are unable to return relevant pages. Unfortunately, it is often the case that a person knows only approximate values for numeric information he or she seeks, and thus is seldom helped by current search engines .
Roussopoulos et al . , "Nearest Neighbor Queries", Proc . of the 1995 ACM SIGMOD Int ' 1 Conf . on Management of Data, pp.71-79 (1995) propose storing attribute-value pairs in a database so that queries can be executed against them and answered using nearest-neighbor techniques. Unfortunately, as recognised herein such solutions do not adequately address the problem of searching with numbers, because very often different documents refer to the same attribute by different names, making it difficult at best to establish correspondences between attribute names and values. Indeed, the major content companies in the electronics industry employ a host of people to manually extract such parametric information. The present invention recognizes, however, that it is not necessary to establish exact correspondences between attribute names and numbers, or indeed to specify attribute names at all, but only numeric queries that approximate desired values, and still produce meaningful query results. Disclosure of the Invention
In accordance with the present invention there is now provided a method for searching documents, with the method including receiving a query that has at least one query number qt but that does not necessarily specify an attribute name or unit for the query number. For documents to be searched, each query number qt is matched with at most one document number nm and vice-versa, with each document number being a number contained in the document or infinity, such that a distance score is minimised. Based on the distance scores, the method returns at least one document in response to the query.
Preferably the present invention provides for accessing the documents over the Internet .
Preferably the present invention provides for the distance score being derived from a distance function that may require, as input, only numbers. However, if desired the distance function may also account for attribute names and/or units that might be associated with the query numbe s. The matching of query numbers to document numbers can be undertaken using a bipartite graph or using heuristic rules or by other means.
Preferably the present invention provides for the method to include limiting the number of documents that are processed using lower bounds on the distance scores. In this implementation, the method can process the documents to support database access and index access and can create, for each query number qt, an ordered list L. of document numbers n. Round-robin access on the lists L. can be performed. When a document D is detected in a list, a database access for the document is executed and the document numbers matched as described above. The round-robin access is halted when a number of documents have been detected having distance scores less than a threshold value t.
Viewed from another aspect the present invention provides for a computer being programmed with instructions for retrieving numbers from a corpus of documents. The instructions include, in response to a numeric query containing at least one numeric query string that is not necessarily associated with an attribute name or unit name, accessing at least some documents in the corpus. The instructions further include comparing each numeric query string with one or more document strings. Each numeric query string is associated with one and only one document string to optimize at least one distance function, with documents being returned based thereon.
Viewed from another aspect the present invention provides for a computer program device having computer readable code thereon for searching a set of documents . The code includes means for receiving a user query including at least one query number corresponding to respective desired attribute values . The query does not necessarily include attribute names or unit names. Means return documents with values close to the query number, using only the query number.
Viewed from another aspect the present invention provides for a computer for searching a set of unstructured documents (e.g., Web pages, specification sheets) and/or semi-structured documents (e.g., spreadsheets with column labels or text documents with variations in style or formatting) and/or structured documents (e.g., database tables including those that have been converted to XML) includes instructions for receiving a user query consisting of a set of numbers corresponding to desired attribute values, and also consisting of respective attribute names of the numbers. Instructions are provided for returning documents containing document values close to the set of numbers in the query using the query numbers and attribute names.
Viewed from another aspect the present invention provides for a computer- implemented method for searching a set of documents in response to a user query including a set of query numbers corresponding to desired attribute values together with units of the values includes returning documents with values approximating the set of query numbers, using the query numbers and units.
Viewed from another aspect the present invention provides for a system for searching a set of documents includes means for receiving a query comprising at least one query number, at least one unit of the query number, and at least one attribute name associated with the query number. Means are provided for returning documents with at least one document value approximating the query number, using the query number, the attribute name, and the unit. Means are provided for the comparing and associating instructions use at least one bipartite graph. Brief description of the drawings
The invention will now be described by way of example only, with reference to the accompanying drawings, in which:
Figure 1 is a block diagram of the present system architecture;
Figure 2 is a flow chart of the preferred document matching logic; and
Figure 3 is a flow chart of the logic for limiting the set of the documents that are to be matched.
Detailed description of the preferred embodiments of the Invention
Referring initially to Figure 1, a system is shown, generally designated 10, for searching documents, including unstructured and semi-structured documents, in one or more databases 12 (only a single database 12 shown for clarity) using numeric queries. A computer 14 for executing the queries accesses the database 12 over a network 15. The network 15 can be the Internet. The computer 14 can include an input device 16, such as a keyboard or mouse, for inputting data to the computer 14, as well as an output device 18, such as a monitor. The computer 14 can be a personal computer made by International Business Machines Corporation (IBM) of Armonk, N.Y. that can have, by way of non-limiting example, a 933 MHz Pentium ®III processor with 512 MB of memory. Other digital processors, however, may be used, such as a laptop computer, mainframe computer, palmtop computer, personal assistant, or any other suitable processing apparatus such as but not limited to a Sun® Hotspot™ server. Likewise, other input devices, including keypads, trackballs, and voice recognition devices can be used, as can other output devices, such as printers, other computers or data storage devices, and computer networks.
In any case, the processor of the computer 14 accesses a query module 20 to undertake certain of the logic of the present invention. The module 20 may be executed by a processor as a series of computer-executable instructions. The instructions may be contained on a data storage device with a computer readable medium, such as a computer diskette having a computer usable medium with code elements stored thereon. Or, the instructions may be stored on random access memory (RAM) of the computer 14, on a DASD array, or on magnetic tape, conventional hard disk drive, electronic read-only memory, optical storage device, or other appropriate data storage device. In an illustrative embodiment of the invention, the computer-executable instructions may be lines of C++ code or JAVA.
Indeed, the flow charts herein illustrate the structure of the logic of the present invention as embodied in computer program software. Those skilled in the art will appreciate that the flow charts illustrate the structures of computer program code elements including logic circuits on an integrated circuit, that function according to this invention. Manifestly, the invention is practised in its essential embodiment by a machine component that renders the program code elements in a form that instructs a digital processing apparatus (that is, a computer) to perform a sequence of function steps corresponding to those shown.
Now referring to Figure 2, the query-document matching logic of the present invention can be seen. It is to be understood that the present logic works best, but not necessarily exclusively, with data spaces that exhibit low reflectivity, i.e., that, for a point xt in the data space that is described by one or more numbers, the data space does not contain very many permutations of the numbers. For example, assume computer memory capacity in a data space typically has values of 32, 64, 126, 256, or 512 megabytes, and further assume that disk space has values of that might appear in the same space of 10, 20, or 30 gigabytes. Under this hypothetical, the only area that might result in confusion if the axes were overlaid on each other or exchanged is that the memory value of "32" might be confused as being close to the disk space value of "30".
More formally, for a set D of m-dimensional points xi (having numeric coordinates vector n , if Q(vector ni) is the number of points within a distance r in m-dimensional space and r (vector nt) is the number of points in D that have at least one reflection within a distance r of vector n1( then the m, r-reflectivity of D is 1-{[1/-D-] (summation over all Xι)Q(vector nL) /r ( vector ni) } . More generally, for a k-dimensional subspace S of D, vector nis represents the co-ordinates of point x. projected onto the subspace S, Q(S, vector nis) is the number of points in D whose projections onto the subspace S are within a distance r of vector niS in k-dimensional space, and r(S, vector nis) is the number of points in D that have at least one k-reflection within a distance r of vector niS, then the S, r-reflectivity is 1-{[1/-D-] (summation over all Xi in D)Q(S, vector n,s)/r(S, vector n5) } , and the reflectivity of D for N k-dimensional subspaces S is 1/N [ {summation over all S} Reflectivity (S,r)]. Low reflectivity data spaces are preferred but not essential. Commencing at block 22, k query numbers qι,...,qk are received. By "number" is meant a string of one or more numerals. Moving to block 24, for each document D to be tested (default, without the logic of Figure 3, is to test all documents in the data space) , a DO loop is entered. Block 26 indicates that if desired and if they are available, attribute and unit names that are associated with the sought-after numbers can, but not must, be received to modify the distance function set forth further below.
At block 28, each query number is matched with a document number "n" such that a distance score generated by a distance function, such as the preferred, non-limiting function set forth below, is minimized. Preferably, each query number is matched with one and only one document number, and each document number likewise preferably is matched with one and only one query number . In the case where the number of query numbers is greater than the number of numeric strings in the document, some query numbers can be matched with infinity in accordance with the disclosure below, which means that the score for that document will be infinity and consequently that the document is highly unlikely to be returned in the results set. This process is repeated for each document to be tested, and the highest ranking documents are returned at block 30.
One non-limiting distance function F is given as follows:
F(Q,D) = ([summation of i from 1 to k]w(qι, n)p)1 p, wherein D is a document that includes m numbers n1( Q is a query that includes k numbers ql7... ,qk such that there is a set of matching document numbers n:1, ... ,n:k in the jch document, w(q., nji) is the distance between qi and ji, and F (Q,D) consequently is the distance score output by the distance function F with the Lp norm (l<p<infinity) .
In the preferred, non-limiting embodiment the above logic is undertaken by constructing a weighted bipartite graph having k source vertices labelled q-.,-.-,qit, corresponding to the k number strings in the query, and m target vertices labelled n1(...,nm, corresponding to the m number strings in the document. If there are more query numbers than document numbers (i.e., m<k) , (k-m) target vertices are added with values of infinity. Then, from each source vertex an edge is constructed to the k closest target vertices and assigned a weight w(q1( n.|)p, wherein w(qι, n3) is assumed to be -qi-nj-/-qi+e- , wherein e is a user-defined bound. The edge weights resulting in the lowest distance score are used to define the one-to-one matching between each source vertex (query number) and a corresponding target vertex (document number) . If desired, however, heuristic algorithms, i.e., dynamic programming, can be used as set forth further below.
When attribute names and, in some cases, unit names as well are available, the above distance function becomes:
F(Q,D) = ([summation of i from 1 to k]w(qi, n,i) + [B x v(Ai, Hi t) p] + [Bu x
Figure imgf000009_0001
wherein H. is the set of attribute names associated with the number n1( Ai is the set of attribute names associated with the query number qi( v(A,, H: is a function that determines the distance between set of attribute names associated with the query number and the set of attribute names associated with the document number (preferably, but not necessarily, equal to 0 if Ai3H g f or if A. = f and 1 otherwise) , B is a user-defined parameter that balances the importance between the match on the numbers and the match on the attribute names, Hju is the set of unit names associated with the number nj( Aiu is the set of unit names associated with the query number q1( u(Aiu, H: u) is a function that determines the distance between the two sets, and Bu is a user-defined parameter that balances the importance between the match on the numbers and the match on the unit names.
"B" can be determined as follows. Suppose a web site provides enough summary with each answer that the user is likely to click on relevant answers. By tracking clicks, a set of queries can be obtained for which the true answers (with high probability) are known. This set can be used to "tune" the algorithm above by using different values for B and selecting the one that yields the highest accuracy against the tune set. A tune set per query size might be required. Additionally, in the presence of a "hint" such as attribute names, the weight of edge (q1# n-j) in the bipartite graph become w(qi( nj)p + B x v(Ai# Hj)p.
Figure 3 shows the logic that can be used if desired to limit the number of documents that must be processed. Commencing at block 32, the documents are processed to support database access for each document (wherein given a document ID, the multi-set of numbers present in the document are returned) , and index access, (wherein given a number, the set of documents having that number is returned) . For index access, the numbers are kept sorted and a B-tree can be used if desired if the index becomes excessively large.
Moreover, block 34 indicates that if desired, the documents can be further processed to account for "hints", i.e., for the presence of attribute and/or unit names, wherein a hint access index is created such that, given a number together with, e.g., its attribute name, the set of documents is returned in which the number is present and the number's attribute name (in the document) is included in the set of attribute names associated with the number.
Moving to block 36, for each query number qt an ordered list is created of document numbers
Figure imgf000010_0001
. At block 38, a lower bound s^ on the distance between qi and nij is associated with every document returned by index access on nij . A list Li for the query number q. is defined to be the documents obtained from index lookup of riiSni2, ... sorted in ascending value of score, lowest score first. These lists need not be materially realized.
In the case where "hints" are present, the operation at block 38 is modified as follows. Assume that at least one attribute name ai is available for a document number ni. For each query term q,, an ordered list <ni 1,a1 1>, <n± 2, ai2>, ... is created and a score Si: associated with the entry <nij,aij> such that s.^s^*1, wherein st j = w(qi( n-|i)p + [B x v(Ai, aιj)p] . This can be done efficiently by using hint access for each attribute name in the set of "hints" (in this case, attribute names A associated with the query term ql# and also for the empty attribute name f.
Moving to block 40, a round-robin access to each of the k sorted lists L is executed. Moving to block 42 for each iteration in the round-robin, as a document D is seen in a list, a database access is executed for the document and the document is processed in accordance with the logic of Figure 2 to render a document distance score F(Q,D).
Next proceeding to block 44, assume nt ' is the number last looked at from the index for the query number qi. A threshold value t, is defined. Moving to decision diamond 46, it is determined whether t documents have been discovered with scores less than the threshold value t. If so, the round-robin is halted at block 48 and those documents having distance scores less than t are returned, if they haven't been already, as the results set. If the test at decision diamond 46 is negative the logic loops back to block 40 to retrieve the next document in the round-robin for test.
In one non- limiting embodiment, the threshold value t can be defined to be ([summation of i from 1 to k] w(qi# ni')p)1 p. At this point, for any document that has not been seen in the index, the closest number to each query term qi must be at least as far from qt as is ni ' , and hence the distance between the document and the query must be at least as high as the threshold value t. Note that the Si score is the lower bound on the distance between query and document numbers, not necessarily the actual distance.
When "hints", e.g., attribute names, are present, the threshold value t can be defined to be ([summation of i from 1 to k] w(qi, ni')p) + [B x v(Ai, ai')p]1 p, wherein n4 ' , a4 ' is the entry in the index last looked at for the query term qt .
As mentioned above, while the use of a bipartite graph is one method for matching query numbers to document numbers, the present invention is not necessarily limited to a single method. For example, a fast heuristic that obtains good matching by exploiting the fact that the user query is likely to contain a small number of terms is given below.
First, in an off-line pre-processing step, the numbers present in a document can be sorted and saved. This preprocessing can be performed for every document in the database. Upon receipt of one or more query numbers, the query numbers are sorted at run-time. For query numbers that match document numbers, the two sorted lists are traversed, making greedy best assignments for each query number.
If the size of the document space is large, a binary search can be performed within the sorted document numbers to locate the closest match for each query number, and then a scan to either side can be executed if that number has already been matched with an earlier query number. To reduce execution time, the index of the last free value before the current match can be recorded.
While the particular METHOD AND SYSTEM FOR SEARCHING DOCUMENTS WITH NUMBERS as herein shown and described in detail is fully capable of attaining the above-described objects of the invention, it is to be understood that it is the presently preferred embodiment of the present invention and is thus representative of the subject matter which is broadly contemplated by the present invention, that the scope of the present invention fully encompasses other embodiments which may become obvious to those skilled in the art, and that the scope of the present invention is accordingly to be limited by nothing other than the appended claims, in which reference to an element in the singular is not intended to mean "one and only one" unless explicitly so stated, but rather "one or more" .
All structural and functional equivalents to the elements of the above-described preferred embodiment that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the present claims. Moreover, it is not necessary for a device or method to address each and every problem sought to be solved by the present invention, for it to be encompassed by the present claims.

Claims

1. A method for searching documents comprising:
receiving a query having at least one query number qt but not necessarily specifying an attribute name or unit for the query number;
for at least one document, matching each number qt with a document number nm, each document number being a number contained in the document, such that a distance score is minimised; and
based on the distance scores, returning at least one document in response to the query.
2. A method as claimed in claim 1, wherein the distance score is derived from a distance function
3. A method as claimed in claim 2, wherein the distance function requires, as input, only numbers.
4. A method as claimed in claim 2, wherein the distance functions accounts for at least one attribute name associated with at least one query number .
5 A method as claimed in claim 2, wherein the distance function accounts for at least one unit name associated with at least one number
6. A method as claimed in claim 4, wherein the distance function accounts for at least one unit name associated with at least one number
7. A method as claimed in claim 2, wherein each document number matches at most one query number, and the distance score of a first document is infinity if more query numbers exist than document numbers in the first document
8. A method as claimed in claim 1, wherein the documents are selected from a set of documents, and the method further comprises limiting the number of documents in the set that are processed in the matching act using at least one lower bound on at least one distance score.
9. A method as claimed in claim 8, comprising processing the documents in the set of documents to support database access and index access.
10. A method as claimed in claim 9, wherein the index access supports hint access .
11. A method as claimed in claim 9, comprising creating, for each query number qi( an ordered list LA of document numbers n, each list being ordered in ascending order of distance scores.
12. A method as claimed in claim 11, comprising performing round-robin access on the lists Lt and, when a document D is detected in a list, performing a database access for the document and undertaking the matching and returning acts.
13. A method as claimed in Claim 12, comprising halting the act of performing a round-robin access when a number of documents have been detected having distance scores less than a threshold value t.
14. A system for searching a set of documents, comprising:
means for receiving a query comprising at least one query number, at least one unit of the query number, and at least one attribute name associated with the query number;
means for returning documents with at least one document value approximating the query number, using the query number, the attribute name, and the unit.
15. A system as claimed in claim 14, wherein the means for returning includes means for returning a ranked list of documents, each document having a value associated with each query number, such that a distance function between the query numbers and corresponding values is minimised.
16. A system as claimed in claim 15, wherein the means for returning uses bipartite graph matching.
17. A system as claimed in claim 15, wherein the means for returning uses at least one heuristic rule.
18. A system as claimed in claim 15, wherein the means for returning includes means for building a sorted index over all values in all documents to limit a set of matches between the document values and the query numbers .
19. A computer program product directly loadable into the internal memory of a digital computer, comprising software code portions for performing the steps of any one of claim 1 to claim 13 when said product is run on a computer.
PCT/GB2003/001482 2002-04-26 2003-04-09 Method and system for searching documents with numbers WO2003091828A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003222598A AU2003222598A1 (en) 2002-04-26 2003-04-09 Method and system for searching documents with numbers

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/134,406 US7010520B2 (en) 2002-04-26 2002-04-26 Method and system for searching documents with numbers
US10/134,406 2002-04-26

Publications (2)

Publication Number Publication Date
WO2003091828A2 true WO2003091828A2 (en) 2003-11-06
WO2003091828A3 WO2003091828A3 (en) 2004-03-04

Family

ID=29249221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2003/001482 WO2003091828A2 (en) 2002-04-26 2003-04-09 Method and system for searching documents with numbers

Country Status (3)

Country Link
US (1) US7010520B2 (en)
AU (1) AU2003222598A1 (en)
WO (1) WO2003091828A2 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7231590B2 (en) * 2004-02-11 2007-06-12 Microsoft Corporation Method and apparatus for visually emphasizing numerical data contained within an electronic document
US7974681B2 (en) 2004-03-05 2011-07-05 Hansen Medical, Inc. Robotic catheter system
US7976539B2 (en) 2004-03-05 2011-07-12 Hansen Medical, Inc. System and method for denaturing and fixing collagenous tissue
US7849048B2 (en) 2005-07-05 2010-12-07 Clarabridge, Inc. System and method of making unstructured data available to structured data analysis tools
US7849049B2 (en) 2005-07-05 2010-12-07 Clarabridge, Inc. Schema and ETL tools for structured and unstructured data
US9070172B2 (en) * 2007-08-27 2015-06-30 Schlumberger Technology Corporation Method and system for data context service
US8266135B2 (en) * 2009-01-05 2012-09-11 International Business Machines Corporation Indexing for regular expressions in text-centric applications
US9477749B2 (en) 2012-03-02 2016-10-25 Clarabridge, Inc. Apparatus for identifying root cause using unstructured data
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
US10217241B2 (en) * 2016-06-15 2019-02-26 Palo Alto Research Center Incorporated System and method for compressing graphs via cliques
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5757983A (en) 1990-08-09 1998-05-26 Hitachi, Ltd. Document retrieval method and system
US5265065A (en) 1991-10-08 1993-11-23 West Publishing Company Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query
US5915250A (en) * 1996-03-29 1999-06-22 Virage, Inc. Threshold-based comparison
US6038560A (en) 1997-05-21 2000-03-14 Oracle Corporation Concept knowledge base search and retrieval system
US6098066A (en) 1997-06-13 2000-08-01 Sun Microsystems, Inc. Method and apparatus for searching for documents stored within a document directory hierarchy
US6105023A (en) 1997-08-18 2000-08-15 Dataware Technologies, Inc. System and method for filtering a document stream
US6175829B1 (en) 1998-04-22 2001-01-16 Nec Usa, Inc. Method and apparatus for facilitating query reformulation
US6662180B1 (en) 1999-05-12 2003-12-09 Matsushita Electric Industrial Co., Ltd. Method for searching in large databases of automatically recognized text
US6285994B1 (en) 1999-05-25 2001-09-04 International Business Machines Corporation Method and system for efficiently searching an encoded vector index
US6381594B1 (en) 1999-07-12 2002-04-30 Yahoo! Inc. System and method for personalized information filtering and alert generation
US20010032200A1 (en) * 2000-02-25 2001-10-18 Greyvenstein Lourence Cornelius Johannes Method and apparatus for providing continuously updated information about an item
US6529903B2 (en) * 2000-07-06 2003-03-04 Google, Inc. Methods and apparatus for using a modified index to provide search results in response to an ambiguous search query

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
AGRAWAL R ET AL: "Searching with Numbers" WWW2002, [Online] 7 - 11 May 2002, XP002265092 Honolulu,Hawaii,USA Retrieved from the Internet: <URL:www2002.org/presentations/srikant.pdf > [retrieved on 2003-12-15] *
NAVARRO G: "Searching in metric spaces by spatial approximation" STRING PROCESSING AND INFORMATION RETRIEVAL SYMPOSIUM, 1999 AND INTERNATIONAL WORKSHOP ON GROUPWARE CANCUN, MEXICO 22-24 SEPT. 1999, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 22 September 1999 (1999-09-22), pages 141-148, XP010353487 ISBN: 0-7695-0268-7 *

Also Published As

Publication number Publication date
WO2003091828A3 (en) 2004-03-04
AU2003222598A1 (en) 2003-11-10
US7010520B2 (en) 2006-03-07
US20030204494A1 (en) 2003-10-30
AU2003222598A8 (en) 2003-11-10

Similar Documents

Publication Publication Date Title
Chan et al. Stratified computation of skylines with partially-ordered domains
US8429147B1 (en) Federation for parallel searching
US9501475B2 (en) Scalable lookup-driven entity extraction from indexed document collections
US7636713B2 (en) Using activation paths to cluster proximity query results
US9552388B2 (en) System and method for providing search query refinements
US6701310B1 (en) Information search device and information search method using topic-centric query routing
US20050210006A1 (en) Field weighting in text searching
US5978797A (en) Multistage intelligent string comparison method
US20060212423A1 (en) System and method for biasing search results based on topic familiarity
US9946753B2 (en) Method and system for document indexing and data querying
US7630945B2 (en) Building support vector machines with reduced classifier complexity
US9977816B1 (en) Link-based ranking of objects that do not include explicitly defined links
US20100293159A1 (en) Systems and methods for extracting phases from text
US7010520B2 (en) Method and system for searching documents with numbers
Yu et al. A methodology to retrieve text documents from multiple databases
US8364672B2 (en) Concept disambiguation via search engine search results
Wang et al. Concept hierarchy based text database categorization in a metasearch engine environment
CN110019738A (en) A kind of processing method of search term, device and computer readable storage medium
Yu et al. Finding the most similar documents across multiple text databases
JP2001184358A (en) Device and method for retrieving information with category factor and program recording medium therefor
Papadopoulos et al. Distributed processing of similarity queries
Gelbukh et al. A method of describing document contents through topic selection
Meng et al. Performance analysis of three text-join algorithms
Aalbersberg et al. InfoGuide: A full-text document retrieval system
RU2266560C1 (en) Method utilized to search for information in poly-topic arrays of unorganized texts

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP