WO2005103950A1 - A method and apparatus for indexing documents - Google Patents

A method and apparatus for indexing documents Download PDF

Info

Publication number
WO2005103950A1
WO2005103950A1 PCT/AU2005/000553 AU2005000553W WO2005103950A1 WO 2005103950 A1 WO2005103950 A1 WO 2005103950A1 AU 2005000553 W AU2005000553 W AU 2005000553W WO 2005103950 A1 WO2005103950 A1 WO 2005103950A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
sub
documents
accordance
title
Prior art date
Application number
PCT/AU2005/000553
Other languages
French (fr)
Inventor
Ken Nguyen
Victor Vickland
Original Assignee
Newsouth Innovations Pty Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2004902103A external-priority patent/AU2004902103A0/en
Application filed by Newsouth Innovations Pty Limited filed Critical Newsouth Innovations Pty Limited
Publication of WO2005103950A1 publication Critical patent/WO2005103950A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Definitions

  • the present invention relates to a method and apparatus for indexing documents to facilitate searching in a computing system environment.
  • a search engine In computer searching, a search engine will use a query submitted by a user to identify a number of documents matching the query. In the majority of cases, for example, the query will include "key words". The search engine identifies documents which contain those key words . The search engine will also usually sort the documents by relevance before presenting the search results to the user. This aids user review of the search results. The relevance of the document is determined by giving the document a "score" , usually based on the ratio of the frequency of the key words in the query to the total number of words in the document. Also, the score is increased if the title of the document contains any of the key words .
  • a problem with this approach is that long documents, in particular, suffer from miscalculation, as they have a low key word frequency to the number of words in the document ratio. This is the case even where the large document may have a small section that is extremely relevant to the search. The document will be given a low relevancy score, and the user may therefore not appreciate the pertinency of the document. There is a need for an approach which may result in a more adequate judgement of relevancy of documents located in computer searching.
  • the present invention provides a method of indexing documents in a computer system, for facilitating assessment of relevance of documents in a search, comprising the steps of identifying subsections of a main document as sub-documents and indexing the sub- documents as documents in their own right, whereby a search engine assessing a relevancy score in a search will assess a separate relevancy score for each sub-document.
  • the method comprises the further step of allocating a title to each sub-document, the title including words from the title of the main document .
  • This has the advantage that where a search engine allocates a higher score to titles than to text of the document, then the score may be increased because the title may contain appropriate key words.
  • users carrying out searches look at the titles of documents to determine whether a document is relevant or not.
  • the method includes the step of indexing the sub-documents so that in a search, identification of a relevant sub-document will be returned as a search result .
  • the method includes the step of providing a link in the sub-document to the main document so that the user retrieving the sub-document in a search can access the main document.
  • the link returns the main document, with the appropriate sub-section that constitutes the sub- document being highlighted or otherwise identified.
  • An advantage of returning the parent document is that the user can then view the sub-document in the context of the parent document .
  • the step of indexing the sub-documents is such that when the search returns the results, it displays the individual relevancy scores of each sub-document identified as a search result.
  • the method also includes the step of determining the context of a sub-document and adding wording indicative of the context to the title of the sub- document. This can again boost the relevancy score of a search engine and also provide the user reviewing the titles returned to him with more knowledge of that document .
  • the present invention provides an apparatus for indexing documents in a computer system to facilitate assessment of relevance of the documents to a search, the apparatus comprising indexing means for identifying subsections of a main document as sub-documents, and indexing the sub-documents as separate documents, so that a search engine assessing a relevancy score will treat each of the sub-documents separately.
  • the indexing means is arranged to allocate a title to each sub-document, the title of each sub-document including wording from the title of the main document .
  • the indexing means is arranged to index each sub-document so that in a search, identification of a relevant sub-document will be returned as a search result. For example, in a search, a series of titles of relevant sub-documents may be returned as the identification.
  • the indexing means is arranged to provide a link in the sub-document to the main document so that the user retrieving the sub-document in a search can access the main document.
  • the link is such that when the user selects the sub-document for display the main document is retrieved and the subsection of the main document which constitutes the sub-document is highlighted or otherwise identified.
  • the indexing means is arranged to determine the context of a sub-document and add wording indicative of the context of the title of the sub- document .
  • the present invention provides a computer program arranged to control a computing system to implement an apparatus in accordance with the second aspect of the present invention.
  • the present invention provides a computer readable medium providing a computer program in accordance with the third aspect .
  • the present invention provides a computer stored document indexed in accordance with the method of the first aspect of the present invention.
  • Figure 1 is a schematic diagram of a computing system which may be used to implement an embodiment of the present invention
  • Figure 2 is a flow diagram illustrating steps in a process in accordance with an embodiment of the present invention
  • Figure 3 is a diagram illustrating an example pathway through a document indexed in accordance with an embodiment of the present invention.
  • FIG. 1 is a schematic block diagram of an example computing system which may be utilised for implementation of a method and system in accordance with an embodiment of the present invention.
  • the illustrated computing system comprises a computer 1 which includes a processor 2 and memory 3.
  • the processor 2 is arranged to process program instructions and data in a known manner.
  • Memory 3 is arranged to store program instructions and data also in a known manner.
  • Processor 2 may constitute one or more processing means, such as integrated circuit processors.
  • the memory 3 may comprise any known memory architecture and may include hard disk, IC memory (ROM, PROM, RAM, etc) , floppy disks and other types of additional memory such as CD ROM, and any other type of memory.
  • a BUS 4 is provided for communication between the processor 2 and memory 3 and also communication with external components.
  • the external components include a user interface 5.
  • the user interface 5 includes a visual display unit 6 for displaying information to a user.
  • the NDU 6 may display information in graphical format or any other format depending upon the program instructions being processed by processor 2.
  • the user interface 5 also includes user input means 7 which in this example include a keyboard 8 (which in this example may be a standard QWERTY keyboard) and a mouse 9.
  • the mouse 9 may be used to manipulate a graphical user interface (GUI) if a GUI is provided by software running on the computer.
  • GUI graphical user interface
  • a network connection 10 is also provided for connecting to a network which may include a communication network and other computers/computing systems.
  • the computing system of Figure 1 may be implemented by any known type of computing hardware such as, for example, a PC, by a number of networked PCs if required to implement a system of this embodiment, by a "mainframe architecture" including a remote computer and user workstations connected to the remote computer, by a client-server architecture, including a client computer accessing a server computer over a network, or by any other computing architecture.
  • This embodiment of the present invention is implemented by appropriate software providing instructions for operation of the computing system hardware to implement the apparatus of the embodiment and implement the method of the embodiment .
  • the computing system need not be connected to a network if this is not required by the software or computer architecture .
  • the apparatus of the present invention includes an indexing means, in this example being in the form of indexing software , for indexing computer stored documentation for subsequent searching purposes.
  • Indexing software is well known.
  • the indexing means of the present invention may be implemented using known indexing software, with modifications to implement the functionality described in the following description.
  • Figure 2 is a schematic flow diagram illustrating a "high level" view of an indexing process in accordance with an embodiment of the present invention. In the indexing process of this embodiment, firstly subsections of the main, document are identified (step 20) . These subsections are then treated separately as sub- documents.
  • the search engine will then treat the sub- documents merely as the search engine would generally treat documents and will assess the sub-documents and provide relevancy scores as if the sub-documents were a standard document in any search system. Standard search engines will do this without requiring any modifications.
  • titles of the sub-documents are created (or amended if the sub-section already had a title) to include wording from the title of the main document. This increases the relevancy score of the sub-document where the search engine counts words in titles as being of greater relevance than words in the document. It also provides the user viewing the title returned by the search engine with more information to determine the relevance of the document .
  • a "context" of the sub-document is determined. For example, the sub-document may fall within a particular subject matter domain.
  • Wording indicative of the context is then added to the title. This provides a user with yet further information to determine the relevancy of the document. It may also increase the relevancy score, depending upon the key words being used by the user in the search.
  • the sub-document is linked to the parent document, so that when a user selects a title of a sub- document which has been returned by a search engine, the main document is retrieved with the sub-document being highlighted or otherwise identified. A user can then therefore read the sub-document in the context of the main document .
  • the sub-documents are indexed. Note that the steps 20, 21, 22, 23 and 24 should not be considered to be performed in the order shown in the flow diagram of Figure 2. They may be performed in any convenient order of the indexing process .
  • a document on asthma may have separate sections on the disease that are relevant to adults and children, titled “Adult” and “Children” , respectively. Within these sections, they could have subsections on the diagnosis and treatment of asthma, titled “Diagnosis” and “Treatment”, respectively.
  • the calculation of the relevancy score by a search engine is usually based on the ratio of the frequency of the key words and a query to the total number of words in the document . Breaking up a main document in accordance with this embodiment of the present invention to sub-documents, and indexing the sub-documents as if they are separate documents, means that any relevancy calculation of the sub-document will mean that it is likely that the key word frequency to the number of words ratio is increased since the sub-documents are not as word bulky as the main document.
  • FIG. 3 illustrates a pathway through a document, in this case a medical text book (Harrisons-On-Line) indexed in accordance with an embodiment of the present invention.
  • the various hierarchical levels (part, section chapter etc.) in the text book are represented by levels A, B, C, D & E.
  • Subsections 60 through 75 have been indexed as sub- documents according to an embodiment of the invention.
  • Sub-document 74 "Acne Vulgaris" is a subsection of “Acne” in “Chapter 56". Sub-document 74 has its own original title “Acne Vulgaris” but acquires a new extended title during the indexing process which includes titles of the sections above it . Chapter 56 is a chapter of "Section 9" ( at Level C) of the document. A title of Section 9 (at Level B) will also be included in the extended title of sub-document 15.
  • the extended title becomes also a pathway which is implemented by way of links and enables a viewer to view the sub-document 74 "Acne Vulgaris" within the context of the large document (Harrisons text book) .
  • the term "document” does cover large documents such as text books.
  • the text book may already be divided into a number of searchable sections and each one of these searchable sections, for the purpose of the present invention, may be considered to be the "main document" from which sub-documents are identified and indexed in accordance with an embodiment of the present invention. Modifications and variations as would be apparent to a skilled addressee are deemed to be within the scope of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of indexing documents in a computer system is disclosed for facilitating assessment of relevance of documents in a search, comprising the steps of identifying subsections of a main document as sub-documents and indexing the sub-documents as documents in their own right, whereby a search engine assessing a relevancy score in a search will assess a separate relevancy score for each sub-document.

Description

A METHOD AND APPARATUS FOR INDEXING DOCUMENTS
Field of the Invention The present invention relates to a method and apparatus for indexing documents to facilitate searching in a computing system environment.
Background of the Invention
In computer searching, a search engine will use a query submitted by a user to identify a number of documents matching the query. In the majority of cases, for example, the query will include "key words". The search engine identifies documents which contain those key words . The search engine will also usually sort the documents by relevance before presenting the search results to the user. This aids user review of the search results. The relevance of the document is determined by giving the document a "score" , usually based on the ratio of the frequency of the key words in the query to the total number of words in the document. Also, the score is increased if the title of the document contains any of the key words . A problem with this approach is that long documents, in particular, suffer from miscalculation, as they have a low key word frequency to the number of words in the document ratio. This is the case even where the large document may have a small section that is extremely relevant to the search. The document will be given a low relevancy score, and the user may therefore not appreciate the pertinency of the document. There is a need for an approach which may result in a more adequate judgement of relevancy of documents located in computer searching.
Summary of the Invention
In a first aspect, the present invention provides a method of indexing documents in a computer system, for facilitating assessment of relevance of documents in a search, comprising the steps of identifying subsections of a main document as sub-documents and indexing the sub- documents as documents in their own right, whereby a search engine assessing a relevancy score in a search will assess a separate relevancy score for each sub-document. This has the advantage that where the main document may have a subsection which is relevant to a search query but the rest of the main document may not be as relevant to the search query, nevertheless using an approach for scoring relevancy on the basis of a key word to total number of words in the document ratio, the subsection will score relatively highly and therefore a user carrying out a search will appreciate its relevance. Preferably, the method comprises the further step of allocating a title to each sub-document, the title including words from the title of the main document . This has the advantage that where a search engine allocates a higher score to titles than to text of the document, then the score may be increased because the title may contain appropriate key words. Further, users carrying out searches look at the titles of documents to determine whether a document is relevant or not. Preferably, if the sub-document has a title, the title of the sub-document and the main document title are combined. Preferably, the method includes the step of indexing the sub-documents so that in a search, identification of a relevant sub-document will be returned as a search result . In a search, for example, a list of titles of relevant sub-documents may be returned as a search result. Preferably, the method includes the step of providing a link in the sub-document to the main document so that the user retrieving the sub-document in a search can access the main document. Preferably, if a user requests a sub-document, the link returns the main document, with the appropriate sub-section that constitutes the sub- document being highlighted or otherwise identified. An advantage of returning the parent document is that the user can then view the sub-document in the context of the parent document . Preferably, the step of indexing the sub-documents is such that when the search returns the results, it displays the individual relevancy scores of each sub-document identified as a search result. Preferably, the method also includes the step of determining the context of a sub-document and adding wording indicative of the context to the title of the sub- document. This can again boost the relevancy score of a search engine and also provide the user reviewing the titles returned to him with more knowledge of that document . In accordance with a second aspect, the present invention provides an apparatus for indexing documents in a computer system to facilitate assessment of relevance of the documents to a search, the apparatus comprising indexing means for identifying subsections of a main document as sub-documents, and indexing the sub-documents as separate documents, so that a search engine assessing a relevancy score will treat each of the sub-documents separately. Preferably, the indexing means is arranged to allocate a title to each sub-document, the title of each sub-document including wording from the title of the main document . Preferably, the indexing means is arranged to index each sub-document so that in a search, identification of a relevant sub-document will be returned as a search result. For example, in a search, a series of titles of relevant sub-documents may be returned as the identification. Preferably, the indexing means is arranged to provide a link in the sub-document to the main document so that the user retrieving the sub-document in a search can access the main document. Preferably, the link is such that when the user selects the sub-document for display the main document is retrieved and the subsection of the main document which constitutes the sub-document is highlighted or otherwise identified. Preferably, the indexing means is arranged to determine the context of a sub-document and add wording indicative of the context of the title of the sub- document . In accordance with a third aspect, the present invention provides a computer program arranged to control a computing system to implement an apparatus in accordance with the second aspect of the present invention. In accordance with a fourth aspect, the present invention provides a computer readable medium providing a computer program in accordance with the third aspect . In accordance with a fifth aspect, the present invention provides a computer stored document indexed in accordance with the method of the first aspect of the present invention.
Brief Description of the Drawings
Features and advantages of the present invention will become apparent from the following description of an embodiment thereof, by way of example only, with reference to the accompanying drawings, in which; Figure 1 is a schematic diagram of a computing system which may be used to implement an embodiment of the present invention; Figure 2 is a flow diagram illustrating steps in a process in accordance with an embodiment of the present invention; and Figure 3 is a diagram illustrating an example pathway through a document indexed in accordance with an embodiment of the present invention.
Detailed Description of Preferred Embodiment
Figure 1 is a schematic block diagram of an example computing system which may be utilised for implementation of a method and system in accordance with an embodiment of the present invention. The illustrated computing system comprises a computer 1 which includes a processor 2 and memory 3. The processor 2 is arranged to process program instructions and data in a known manner. Memory 3 is arranged to store program instructions and data also in a known manner. Processor 2 may constitute one or more processing means, such as integrated circuit processors. The memory 3 may comprise any known memory architecture and may include hard disk, IC memory (ROM, PROM, RAM, etc) , floppy disks and other types of additional memory such as CD ROM, and any other type of memory. A BUS 4 is provided for communication between the processor 2 and memory 3 and also communication with external components. In this case the external components include a user interface 5. The user interface 5 includes a visual display unit 6 for displaying information to a user. The NDU 6 may display information in graphical format or any other format depending upon the program instructions being processed by processor 2. The user interface 5 also includes user input means 7 which in this example include a keyboard 8 (which in this example may be a standard QWERTY keyboard) and a mouse 9. The mouse 9 may be used to manipulate a graphical user interface (GUI) if a GUI is provided by software running on the computer. A network connection 10 is also provided for connecting to a network which may include a communication network and other computers/computing systems. The computing system of Figure 1 may be implemented by any known type of computing hardware such as, for example, a PC, by a number of networked PCs if required to implement a system of this embodiment, by a "mainframe architecture" including a remote computer and user workstations connected to the remote computer, by a client-server architecture, including a client computer accessing a server computer over a network, or by any other computing architecture. This embodiment of the present invention is implemented by appropriate software providing instructions for operation of the computing system hardware to implement the apparatus of the embodiment and implement the method of the embodiment . The computing system need not be connected to a network if this is not required by the software or computer architecture . The apparatus of the present invention includes an indexing means, in this example being in the form of indexing software , for indexing computer stored documentation for subsequent searching purposes. Indexing software is well known. The indexing means of the present invention may be implemented using known indexing software, with modifications to implement the functionality described in the following description. Figure 2 is a schematic flow diagram illustrating a "high level" view of an indexing process in accordance with an embodiment of the present invention. In the indexing process of this embodiment, firstly subsections of the main, document are identified (step 20) . These subsections are then treated separately as sub- documents. The search engine will then treat the sub- documents merely as the search engine would generally treat documents and will assess the sub-documents and provide relevancy scores as if the sub-documents were a standard document in any search system. Standard search engines will do this without requiring any modifications. At step 21 titles of the sub-documents are created (or amended if the sub-section already had a title) to include wording from the title of the main document. This increases the relevancy score of the sub-document where the search engine counts words in titles as being of greater relevance than words in the document. It also provides the user viewing the title returned by the search engine with more information to determine the relevance of the document . In step 22, a "context" of the sub-document is determined. For example, the sub-document may fall within a particular subject matter domain. Wording indicative of the context is then added to the title. This provides a user with yet further information to determine the relevancy of the document. It may also increase the relevancy score, depending upon the key words being used by the user in the search. In step 23, the sub-document is linked to the parent document, so that when a user selects a title of a sub- document which has been returned by a search engine, the main document is retrieved with the sub-document being highlighted or otherwise identified. A user can then therefore read the sub-document in the context of the main document . At step 24, the sub-documents are indexed. Note that the steps 20, 21, 22, 23 and 24 should not be considered to be performed in the order shown in the flow diagram of Figure 2. They may be performed in any convenient order of the indexing process . The following is an example of application of this embodiment of the present invention: A document on asthma, with the title "Asthma", may have separate sections on the disease that are relevant to adults and children, titled "Adult" and "Children" , respectively. Within these sections, they could have subsections on the diagnosis and treatment of asthma, titled "Diagnosis" and "Treatment", respectively. So in accordance with this embodiment, a total of 6 sub- documents would be created, with the titles: "Asthma: Adult"; "Asthma: Children"; "Asthma: Adult: Diagnosis"; "Asthma: Adult: Treatment"; "Asthma: Children: Diagnosis"; and "Asthma: Children: Treatment". These documents are then indexed, along with the original, parent document, "Asthma" . A search on "asthma and children" will return high scores for all of the sub-documents, as they all contain the key word "asthma" in the title (as well as the body), but higher scores for the sub-documents "Asthma: Children", "Asthma: Children: Diagnosis", "Asthma:
Children: Treatment". A search on "asthma and treatment" would return high scores for "Asthma: Adult: Treatment" and "Asthma: Children: Treatment". An extra step may be taken to boost the relevancy score. If the document appears in a certain context or specific domain, then this context can be "added" to the title just before indexing. Continuing the example above, if the Asthma document was to be indexed as part of a medical information site, then the context "Respiratory Disease" may be added to the titles of the sub-documents before they are indexed. So a search on "respiratory disease and diagnosis" would return "Asthma: Adult: Diagnosis" and "Asthma: Children: Diagnosis", even though the documents do not contain the words "respiratory disease" . When the search engine returns the results of the index query, it is not a good idea to present the sub- documents as they are. This is because the sub-document should be viewed within the context of the parent document. Example, it's not a good idea to just look at the "Asthma: Children: Treatment" sub-document . You want to see how it fits into the "Asthma: Children" sub- document, and, in fact, the whole "Asthma" document. So instead of returning the sub-document, the whole original document is returned, and then the user zoomed-in or focused into the sub-section. This allows the user to see the whole document, but be transported to the relevant section. This can be done easily in HTML by using the anchor tags (<a>) . The HTML source of the above Asthma document could have this line to mark the "Asthma: Adult" subsection. <h3xa name="Adult">Adult</axh3> This will load the Asthma document, but transport us to the Adult section. This link would then be used in the index, rather than a link to the sub-document. As discussed above, the calculation of the relevancy score by a search engine is usually based on the ratio of the frequency of the key words and a query to the total number of words in the document . Breaking up a main document in accordance with this embodiment of the present invention to sub-documents, and indexing the sub-documents as if they are separate documents, means that any relevancy calculation of the sub-document will mean that it is likely that the key word frequency to the number of words ratio is increased since the sub-documents are not as word bulky as the main document. Further, as the sub- documents are allocated titles which are now exposed to the search engine, whereas words in any title of subsections would previously have been treated just like any other word in the calculation of the relevancy score, now they will be in a "title" and allocated appropriate relevancy (usually in increased relevancy over the words within the sub-document) . Figure 3 illustrates a pathway through a document, in this case a medical text book (Harrisons-On-Line) indexed in accordance with an embodiment of the present invention. The various hierarchical levels (part, section chapter etc.) in the text book are represented by levels A, B, C, D & E. Subsections 60 through 75 have been indexed as sub- documents according to an embodiment of the invention. Sub-document 74 "Acne Vulgaris" is a subsection of "Acne" in "Chapter 56". Sub-document 74 has its own original title "Acne Vulgaris" but acquires a new extended title during the indexing process which includes titles of the sections above it . Chapter 56 is a chapter of "Section 9" ( at Level C) of the document. A title of Section 9 (at Level B) will also be included in the extended title of sub-document 15. The extended title of sub-document 74 "Acne Vulgaris" will contain the following : Level A (Cardinal manifestations and presentation of disease) + Level B (Alteration in the skin) + Level C (Eczema, Psoriasis, Cutaneous infections, Acne and other common skin disorders) + Level D (Acne) + Level E (Acne Vulgaris) . As discussed above, because sub-document 74 "Acne Vulgaris" is a small unit on its own indexed separately, the relevance of this section for a key word search including the word "Acne" would be increased. This is due to the same amount of detected key words within a much smaller document. The extended title becomes also a pathway which is implemented by way of links and enables a viewer to view the sub-document 74 "Acne Vulgaris" within the context of the large document (Harrisons text book) . It will be appreciated that the term "document" does cover large documents such as text books. The text book may already be divided into a number of searchable sections and each one of these searchable sections, for the purpose of the present invention, may be considered to be the "main document" from which sub-documents are identified and indexed in accordance with an embodiment of the present invention. Modifications and variations as would be apparent to a skilled addressee are deemed to be within the scope of the present invention.

Claims

CLAIMS :
1. A method of indexing documents in a computer system, for facilitating assessment of relevance of documents in a search, comprising the steps of identifying subsections of a main document as sub-documents and indexing the sub- documents as documents in their own right, whereby a search engine assessing a relevancy score in a search will assess a separate relevancy score for each sub-document.
2. A method in accordance with claim 1, comprising the further step of allocating a title to each sub-document, the title including the word or words from the title of the main document .
3. A method in accordance with claim 1 or claim 2 , wherein if the sub-section has a title, at least a portion of the title of the sub-section and at least a portion of the title of the main document are combined in the step of allocating a title to the sub-document.
4. A method in accordance with claim 1, 2 or 3 including the step of indexing the sub-document so that in a search, identification of the relevant sub-document will be returned as a search result .
5. A method in accordance with claim 4, including the further step of providing a link in the sub-document to the main document, so that a user retrieving a sub- document can access the main document .
6. A method in accordance with claim 5, wherein the step of providing the link includes providing the link so that when the user retrieves a sub-document, the main document is retrieved.
7. A method in accordance with claim 6, wherein the step of providing a link is such that when the main document is retrieved the sub-document is identified in the main document .
8. A method in accordance with any one of the preceding claims, including the further step of determining of the context of the sub-document and adding wording indicative of the context to the title of the sub-document.
9. An apparatus for indexing documents in a computer system to facilitate assessment of relevance of the documents to a search, the apparatus comprising indexing means for identifying subsections of the main document as sub-documents, and indexing the sub-documents, so that a search engine assessing a relevancy score will treat each of the sub-documents separately.
10. An apparatus in accordance with claim 9, wherein the indexing means is arranged to allocate title to each sub- document, the title of each sub-document including wording from the title of the main document.
11. An apparatus in accordance with claim 9 or claim 10, wherein the indexing means is arranged to index each sub- document so that in a search, identification of a relevant sub-document will be returned as a search result.
12. An apparatus in accordance with claim 11, wherein the indexing means is arranged to provide a link in each sub- document to the main document, so that a user retrieving the sub-document in a search can access the main document.
13. An apparatus in accordance with claim 12, wherein the indexing means is arranged to index each sub-document so that when a user retrieves the sub-document the main document is returned and the sub-document is identified within the main document .
14. An apparatus in accordance with any one of claims 10 to 12, wherein the indexing means is arranged to determine the context of a sub-document and add wording indicative of the context to the title of the sub-document.
15. A computer program arranged to control a computing- system to implement an apparatus in accordance with any one of claims 10 to 14.
16. A computer readable medium providing a computer program in accordance with claim 15.
17. A computer stored document indexed in accordance with the method of any one of claims 1 to 9.
Dated this 19th day of April 2005 Unisearch Limited
By their Patent Attorneys GRIFFITH HACK
PCT/AU2005/000553 2004-04-20 2005-04-19 A method and apparatus for indexing documents WO2005103950A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AU2004902103A AU2004902103A0 (en) 2004-04-20 A method and apparatus for indexing documents
AU2004902103 2004-04-20

Publications (1)

Publication Number Publication Date
WO2005103950A1 true WO2005103950A1 (en) 2005-11-03

Family

ID=35197179

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2005/000553 WO2005103950A1 (en) 2004-04-20 2005-04-19 A method and apparatus for indexing documents

Country Status (1)

Country Link
WO (1) WO2005103950A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1145254A (en) * 1997-07-25 1999-02-16 Just Syst Corp Document retrieval device and computer readable recording medium recorded with program for functioning computer as the device
US5999925A (en) * 1997-07-25 1999-12-07 Claritech Corporation Information retrieval based on use of sub-documents
US6256622B1 (en) * 1998-04-21 2001-07-03 Apple Computer, Inc. Logical division of files into multiple articles for search and retrieval
US6631373B1 (en) * 1999-03-02 2003-10-07 Canon Kabushiki Kaisha Segmented document indexing and search

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1145254A (en) * 1997-07-25 1999-02-16 Just Syst Corp Document retrieval device and computer readable recording medium recorded with program for functioning computer as the device
US5999925A (en) * 1997-07-25 1999-12-07 Claritech Corporation Information retrieval based on use of sub-documents
US6256622B1 (en) * 1998-04-21 2001-07-03 Apple Computer, Inc. Logical division of files into multiple articles for search and retrieval
US6631373B1 (en) * 1999-03-02 2003-10-07 Canon Kabushiki Kaisha Segmented document indexing and search

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
KASZIEL M. ET AL: "Efficient Passage Ranking for Document Databases", ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 17, no. 4, October 1999 (1999-10-01), pages 406 - 439 *
MOFFAT A. ET AL: "Retrieval of Partial Documents", PROC. SECOND TEXT RETRIEVAL CONFERENCE, 1993, pages 181 - 190 *
PROC. 25TH EUROPEAN CONFERENCE ON INFORMATION RETRIEVAL RESEARCH, April 2003 (2003-04-01) *
SALTON G. ET AL: "Approaches to Passage Retrieval in Full Text Information Systems", PROC. 16TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, June 1993 (1993-06-01), pages 49 - 58, XP000603219 *
XIAO-LING Q. ET AL: "A New Method to Query Document Database by Content and Structure", JOURNAL OF SOFTWARE, vol. 13, no. 4, 2002, Retrieved from the Internet <URL:http://research.microsoft.com/asia/dload_files/group/mediasearching/2002p/wang_JS.pdf> *
YU S. ET AL: "Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation", PROC. 12TH INTERNATIONAL WORLD WIDE WEB CONFERENCE, 2003, pages 11 - 18, XP058239890, DOI: doi:10.1145/775152.775155 *

Similar Documents

Publication Publication Date Title
Shahidi et al. Effectiveness of mindfulness-based stress reduction on emotion regulation and test anxiety in female high school students
US11086883B2 (en) Systems and methods for suggesting content to a writer based on contents of a document
US7774328B2 (en) Browseable fact repository
Beall The weaknesses of full-text searching
Rowe Obituary for the newspaper? Tracking the tabloid
US7783644B1 (en) Query-independent entity importance in books
Koshman et al. Web searching on the Vivisimo search engine
US8370334B2 (en) Dynamic updating of display and ranking for search results
US7716207B2 (en) Search engine methods and systems for displaying relevant topics
US7702611B2 (en) Method for automatically performing conceptual highlighting in electronic text
US9081765B2 (en) Displaying examples from texts in dictionaries
US20090055394A1 (en) Identifying key terms related to similar passages
US20120095993A1 (en) Ranking by similarity level in meaning for written documents
JP2019514124A (en) System and method for providing visualizable result lists
KR20070039072A (en) Results based personalization of advertisements in a search engine
WO2007002820A2 (en) Search engine with augmented relevance ranking by community participation
Miotto et al. eTACTS: a method for dynamically filtering clinical trial search results
Wolfram Applications of informetrics to information retrieval research
US7509303B1 (en) Information retrieval system using attribute normalization
JP2001084256A (en) Device and method for processing database and computer readable storage medium with database processing program recorded therein
US20060053105A1 (en) Method for information retrieval
Trivedi A study of search engines for health sciences
WO2005103950A1 (en) A method and apparatus for indexing documents
JP2012043258A (en) Retrieval system, retrieval device, retrieval program, recording medium and retrieval method
KR20010094228A (en) Fitted multi-searching system for daily information

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DPEN Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase