A METHOD AND APPARATUS FOR INDEXING DOCUMENTS
Field of the Invention The present invention relates to a method and apparatus for indexing documents to facilitate searching in a computing system environment.
Background of the Invention
In computer searching, a search engine will use a query submitted by a user to identify a number of documents matching the query. In the majority of cases, for example, the query will include "key words". The search engine identifies documents which contain those key words . The search engine will also usually sort the documents by relevance before presenting the search results to the user. This aids user review of the search results. The relevance of the document is determined by giving the document a "score" , usually based on the ratio of the frequency of the key words in the query to the total number of words in the document. Also, the score is increased if the title of the document contains any of the key words . A problem with this approach is that long documents, in particular, suffer from miscalculation, as they have a low key word frequency to the number of words in the document ratio. This is the case even where the large document may have a small section that is extremely relevant to the search. The document will be given a low
relevancy score, and the user may therefore not appreciate the pertinency of the document. There is a need for an approach which may result in a more adequate judgement of relevancy of documents located in computer searching.
Summary of the Invention
In a first aspect, the present invention provides a method of indexing documents in a computer system, for facilitating assessment of relevance of documents in a search, comprising the steps of identifying subsections of a main document as sub-documents and indexing the sub- documents as documents in their own right, whereby a search engine assessing a relevancy score in a search will assess a separate relevancy score for each sub-document. This has the advantage that where the main document may have a subsection which is relevant to a search query but the rest of the main document may not be as relevant to the search query, nevertheless using an approach for scoring relevancy on the basis of a key word to total number of words in the document ratio, the subsection will score relatively highly and therefore a user carrying out a search will appreciate its relevance. Preferably, the method comprises the further step of allocating a title to each sub-document, the title including words from the title of the main document . This has the advantage that where a search engine allocates a higher score to titles than to text of the document, then the score may be increased because the title may contain appropriate key words. Further, users carrying out searches look at the titles of documents to determine whether a document is relevant or not.
Preferably, if the sub-document has a title, the title of the sub-document and the main document title are combined. Preferably, the method includes the step of indexing the sub-documents so that in a search, identification of a relevant sub-document will be returned as a search result . In a search, for example, a list of titles of relevant sub-documents may be returned as a search result. Preferably, the method includes the step of providing a link in the sub-document to the main document so that the user retrieving the sub-document in a search can access the main document. Preferably, if a user requests a sub-document, the link returns the main document, with the appropriate sub-section that constitutes the sub- document being highlighted or otherwise identified. An advantage of returning the parent document is that the user can then view the sub-document in the context of the parent document . Preferably, the step of indexing the sub-documents is such that when the search returns the results, it displays the individual relevancy scores of each sub-document identified as a search result. Preferably, the method also includes the step of determining the context of a sub-document and adding wording indicative of the context to the title of the sub- document. This can again boost the relevancy score of a search engine and also provide the user reviewing the titles returned to him with more knowledge of that document . In accordance with a second aspect, the present invention provides an apparatus for indexing documents in a computer system to facilitate assessment of relevance of the documents to a search, the apparatus comprising
indexing means for identifying subsections of a main document as sub-documents, and indexing the sub-documents as separate documents, so that a search engine assessing a relevancy score will treat each of the sub-documents separately. Preferably, the indexing means is arranged to allocate a title to each sub-document, the title of each sub-document including wording from the title of the main document . Preferably, the indexing means is arranged to index each sub-document so that in a search, identification of a relevant sub-document will be returned as a search result. For example, in a search, a series of titles of relevant sub-documents may be returned as the identification. Preferably, the indexing means is arranged to provide a link in the sub-document to the main document so that the user retrieving the sub-document in a search can access the main document. Preferably, the link is such that when the user selects the sub-document for display the main document is retrieved and the subsection of the main document which constitutes the sub-document is highlighted or otherwise identified. Preferably, the indexing means is arranged to determine the context of a sub-document and add wording indicative of the context of the title of the sub- document . In accordance with a third aspect, the present invention provides a computer program arranged to control a computing system to implement an apparatus in accordance with the second aspect of the present invention.
In accordance with a fourth aspect, the present invention provides a computer readable medium providing a computer program in accordance with the third aspect . In accordance with a fifth aspect, the present invention provides a computer stored document indexed in accordance with the method of the first aspect of the present invention.
Brief Description of the Drawings
Features and advantages of the present invention will become apparent from the following description of an embodiment thereof, by way of example only, with reference to the accompanying drawings, in which; Figure 1 is a schematic diagram of a computing system which may be used to implement an embodiment of the present invention; Figure 2 is a flow diagram illustrating steps in a process in accordance with an embodiment of the present invention; and Figure 3 is a diagram illustrating an example pathway through a document indexed in accordance with an embodiment of the present invention.
Detailed Description of Preferred Embodiment
Figure 1 is a schematic block diagram of an example computing system which may be utilised for implementation of a method and system in accordance with an embodiment of the present invention. The illustrated computing system comprises a computer 1 which includes a processor 2 and memory 3. The processor 2 is arranged to process program instructions
and data in a known manner. Memory 3 is arranged to store program instructions and data also in a known manner. Processor 2 may constitute one or more processing means, such as integrated circuit processors. The memory 3 may comprise any known memory architecture and may include hard disk, IC memory (ROM, PROM, RAM, etc) , floppy disks and other types of additional memory such as CD ROM, and any other type of memory. A BUS 4 is provided for communication between the processor 2 and memory 3 and also communication with external components. In this case the external components include a user interface 5. The user interface 5 includes a visual display unit 6 for displaying information to a user. The NDU 6 may display information in graphical format or any other format depending upon the program instructions being processed by processor 2. The user interface 5 also includes user input means 7 which in this example include a keyboard 8 (which in this example may be a standard QWERTY keyboard) and a mouse 9. The mouse 9 may be used to manipulate a graphical user interface (GUI) if a GUI is provided by software running on the computer. A network connection 10 is also provided for connecting to a network which may include a communication network and other computers/computing systems. The computing system of Figure 1 may be implemented by any known type of computing hardware such as, for example, a PC, by a number of networked PCs if required to implement a system of this embodiment, by a "mainframe architecture" including a remote computer and user workstations connected to the remote computer, by a client-server architecture, including a client computer accessing a server computer over a network, or by any
other computing architecture. This embodiment of the present invention is implemented by appropriate software providing instructions for operation of the computing system hardware to implement the apparatus of the embodiment and implement the method of the embodiment . The computing system need not be connected to a network if this is not required by the software or computer architecture . The apparatus of the present invention includes an indexing means, in this example being in the form of indexing software , for indexing computer stored documentation for subsequent searching purposes. Indexing software is well known. The indexing means of the present invention may be implemented using known indexing software, with modifications to implement the functionality described in the following description. Figure 2 is a schematic flow diagram illustrating a "high level" view of an indexing process in accordance with an embodiment of the present invention. In the indexing process of this embodiment, firstly subsections of the main, document are identified (step 20) . These subsections are then treated separately as sub- documents. The search engine will then treat the sub- documents merely as the search engine would generally treat documents and will assess the sub-documents and provide relevancy scores as if the sub-documents were a standard document in any search system. Standard search engines will do this without requiring any modifications. At step 21 titles of the sub-documents are created (or amended if the sub-section already had a title) to include wording from the title of the main document. This increases the relevancy score of the sub-document where the search engine counts words in titles as being of
greater relevance than words in the document. It also provides the user viewing the title returned by the search engine with more information to determine the relevance of the document . In step 22, a "context" of the sub-document is determined. For example, the sub-document may fall within a particular subject matter domain. Wording indicative of the context is then added to the title. This provides a user with yet further information to determine the relevancy of the document. It may also increase the relevancy score, depending upon the key words being used by the user in the search. In step 23, the sub-document is linked to the parent document, so that when a user selects a title of a sub- document which has been returned by a search engine, the main document is retrieved with the sub-document being highlighted or otherwise identified. A user can then therefore read the sub-document in the context of the main document . At step 24, the sub-documents are indexed. Note that the steps 20, 21, 22, 23 and 24 should not be considered to be performed in the order shown in the flow diagram of Figure 2. They may be performed in any convenient order of the indexing process . The following is an example of application of this embodiment of the present invention: A document on asthma, with the title "Asthma", may have separate sections on the disease that are relevant to adults and children, titled "Adult" and "Children" , respectively. Within these sections, they could have subsections on the diagnosis and treatment of asthma, titled "Diagnosis" and "Treatment", respectively. So in accordance with this embodiment, a total of 6 sub-
documents would be created, with the titles: "Asthma: Adult"; "Asthma: Children"; "Asthma: Adult: Diagnosis"; "Asthma: Adult: Treatment"; "Asthma: Children: Diagnosis"; and "Asthma: Children: Treatment". These documents are then indexed, along with the original, parent document, "Asthma" . A search on "asthma and children" will return high scores for all of the sub-documents, as they all contain the key word "asthma" in the title (as well as the body), but higher scores for the sub-documents "Asthma: Children", "Asthma: Children: Diagnosis", "Asthma:
Children: Treatment". A search on "asthma and treatment" would return high scores for "Asthma: Adult: Treatment" and "Asthma: Children: Treatment". An extra step may be taken to boost the relevancy score. If the document appears in a certain context or specific domain, then this context can be "added" to the title just before indexing. Continuing the example above, if the Asthma document was to be indexed as part of a medical information site, then the context "Respiratory Disease" may be added to the titles of the sub-documents before they are indexed. So a search on "respiratory disease and diagnosis" would return "Asthma: Adult: Diagnosis" and "Asthma: Children: Diagnosis", even though the documents do not contain the words "respiratory disease" . When the search engine returns the results of the index query, it is not a good idea to present the sub- documents as they are. This is because the sub-document should be viewed within the context of the parent document. Example, it's not a good idea to just look at the "Asthma: Children: Treatment" sub-document . You want to see how it fits into the "Asthma: Children" sub- document, and, in fact, the whole "Asthma" document. So
instead of returning the sub-document, the whole original document is returned, and then the user zoomed-in or focused into the sub-section. This allows the user to see the whole document, but be transported to the relevant section. This can be done easily in HTML by using the anchor tags (<a>) . The HTML source of the above Asthma document could have this line to mark the "Asthma: Adult" subsection. <h3xa name="Adult">Adult</axh3> This will load the Asthma document, but transport us to the Adult section. This link would then be used in the index, rather than a link to the sub-document. As discussed above, the calculation of the relevancy score by a search engine is usually based on the ratio of the frequency of the key words and a query to the total number of words in the document . Breaking up a main document in accordance with this embodiment of the present invention to sub-documents, and indexing the sub-documents as if they are separate documents, means that any relevancy calculation of the sub-document will mean that it is likely that the key word frequency to the number of words ratio is increased since the sub-documents are not as word bulky as the main document. Further, as the sub- documents are allocated titles which are now exposed to the search engine, whereas words in any title of subsections would previously have been treated just like any other word in the calculation of the relevancy score, now they will be in a "title" and allocated appropriate relevancy (usually in increased relevancy over the words within the sub-document) . Figure 3 illustrates a pathway through a document, in this case a medical text book (Harrisons-On-Line) indexed
in accordance with an embodiment of the present invention. The various hierarchical levels (part, section chapter etc.) in the text book are represented by levels A, B, C, D & E. Subsections 60 through 75 have been indexed as sub- documents according to an embodiment of the invention. Sub-document 74 "Acne Vulgaris" is a subsection of "Acne" in "Chapter 56". Sub-document 74 has its own original title "Acne Vulgaris" but acquires a new extended title during the indexing process which includes titles of the sections above it . Chapter 56 is a chapter of "Section 9" ( at Level C) of the document. A title of Section 9 (at Level B) will also be included in the extended title of sub-document 15. The extended title of sub-document 74 "Acne Vulgaris" will contain the following : Level A (Cardinal manifestations and presentation of disease) + Level B (Alteration in the skin) + Level C (Eczema, Psoriasis, Cutaneous infections, Acne and other common skin disorders) + Level D (Acne) + Level E (Acne Vulgaris) . As discussed above, because sub-document 74 "Acne Vulgaris" is a small unit on its own indexed separately, the relevance of this section for a key word search including the word "Acne" would be increased. This is due to the same amount of detected key words within a much smaller document. The extended title becomes also a pathway which is implemented by way of links and enables a viewer to view the sub-document 74 "Acne Vulgaris" within the context of the large document (Harrisons text book) . It will be appreciated that the term "document" does cover large documents such as text books. The text book may already be divided into a number of searchable
sections and each one of these searchable sections, for the purpose of the present invention, may be considered to be the "main document" from which sub-documents are identified and indexed in accordance with an embodiment of the present invention. Modifications and variations as would be apparent to a skilled addressee are deemed to be within the scope of the present invention.