WO2007011140A1 - Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues - Google Patents

Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues Download PDF

Info

Publication number
WO2007011140A1
WO2007011140A1 PCT/KR2006/002787 KR2006002787W WO2007011140A1 WO 2007011140 A1 WO2007011140 A1 WO 2007011140A1 KR 2006002787 W KR2006002787 W KR 2006002787W WO 2007011140 A1 WO2007011140 A1 WO 2007011140A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate phrases
phrases
documents
extracting
secondary candidate
Prior art date
Application number
PCT/KR2006/002787
Other languages
French (fr)
Inventor
Eun-Young Lee
Mi-Na Han
Eui-Vin Park
Sung-Jin Lee
Hoon-Seok Son
Joong-Ho Shin
Original Assignee
Chutnoon Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chutnoon Inc. filed Critical Chutnoon Inc.
Publication of WO2007011140A1 publication Critical patent/WO2007011140A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Definitions

  • the present invention relates to an information search technology and, more particularly, to a method and apparatus for extracting topics from search results and providing the search results based on the topics, and a method and apparatus for selecting and providing frequently appearing search results as issues.
  • a conventional search system groups search results into groups based on their types, sequentially provides the search results based on similarities with search words, or places search results that are most similar to the search words at the top of search pages.
  • the present invention provides a method and apparatus for searching for information based on topics by extracting phrases constituting search results to select topics and outputting the search results topic-by-topic so that users can obtain desired information more easily.
  • the present invention further provides a method of searching for information based on issues by outputting Internet search results as issues in order of appearance frequencies of the search results.
  • FIG. 1 is a view for explaining a method of providing search results based on topics according to an embodiment of the present invention
  • FIG. 2 is a flow chart of a method of extracting topics according to an embodiment of the present invention.
  • FIGs. 3 to 9 are views for explaining a method of extracting topics according to an embodiment of the present invention.
  • FIG. 10 is a flow chart of a method of extracting issues according to an embodiment of the present invention.
  • FIGs. 11 to 13 are views for explaining a method of extracting issues according to an embodiment of the present invention.
  • Fig. 14 is an issue output result
  • Fig. 15 is another issue output result
  • FIG. 16 is a block diagram of an information search apparatus according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
  • a method of displaying search results with respect to a search word including: (a) referring to words contained in titles or content of search results matching with the search word to calculate similarities between the search results according to a predetermined similarity calculation method, and extracting representative phrases among combinations of words repeatedly contained in similar search results; and (b) displaying the representative phrases and the search results that belong to each of the representative phrases.
  • a method of extracting topics including: (a) assigning document IDs to documents with respect to a search word based on appearance orders of the documents, and extracting documents with document IDs less than a predetermined value; (b) extracting words contained in titles or content of the extracted documents and appearance frequencies of the words; (c) extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents; (d) generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; (e) calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases; and (f) eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
  • a method of extracting issues including: (a) extracting the same or similar data the number of which is greater than a predetermined threshold value among stored data; and (b) extracting as issue data a plurality of high-ranking data among the extracted data and displaying the issue data in order of writing time of the issue data or in order of a number of similar documents.
  • an apparatus for providing search services based on extracted topics including: a searching unit searching for stored documents; a primary candidate phrase extracting unit sequentially assigning document IDs to searched documents based on appearance orders of the searched documents, and extracting documents with document IDs less than a predetermined value; a secondary candidate phrase extracting unit extracting words contained in titles or content of the extracted documents and appearance frequencies of the words, extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents, generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; and a similar candidate phrase eliminating unit calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases, eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
  • Fig. 1 is a view for explaining a method of providing search results based on topics according to an embodiment of the present invention.
  • 'E Founding_anniversary'
  • 'F Party'.
  • 1 ABE', 1 ABF', and 'ABD' may be grouped into a group
  • 'CDE', 'CDF' and 'CDG' may be grouped into a group
  • 'AEFG', 'AEFH', and 'AEFI' may be grouped into a group.
  • 'AB' becomes a topic 100
  • 'CD' becomes a topic
  • 'AEF' becomes a topic.
  • the term 'topic' implies an expression indicating a subject of search results.
  • FIG. 2 is a flow chart of a method of extracting topics according to an embodiment of the present invention.
  • similarities between search results are calculated according to a similarity calculation method by referring to words that are included in titles or content in the search results and match with the search word. Further, representative phrases are extracted among a combination of duplicate words in similar search results, and search results are displayed according to the extracted representative phrases.
  • document IDs are sequentially assigned to documents matching with a search word based on appearance orders of the documents, and documents with documents IDs less than a predetermined value are extracted (operation S210).
  • the predetermined value may vary based on the number of search results, i.e., documents, or the like.
  • Data composed of 'words' which are included in titles or content of the documents, and 'Appearance frequencies of the words' are stored (operation S220).
  • primary candidate phrases composed of words of appearance frequencies greater than a predetermined value in the titles or content of the documents are extracted (operation S230).
  • the predetermined value may vary according to the number of primary candidate phrases to be extracted.
  • secondary candidate phrases are generated from combinations of phrases composed of the words constituting the primary candidate phrases, and weight values of the secondary candidate phrases are calculated (operation S240).
  • the weight values of secondary candidate phrases are calculated by referring to document IDs included in the secondary candidate phrases, appearance frequencies of words constituting the secondary candidate phrases, and the number of primary candidate phrases used in the secondary candidate phrases. For instance, since a document with a low document ID is important, its weight value becomes high. In addition, if appearance frequency of words constituting the secondary candidate phrase is high, it is regarded as an important document. Further, if document ID included in the secondary candidate phrase is low, it is regarded as an important document.
  • similarities between secondary candidate phrases with weight values greater than a predetermined value are calculated by use of vectors consisting of document IDs of documents that belong to the secondary candidate phrases (operation S250). That is, when there are several document IDs, the similarities are calculated by referring to the number of the same document IDs.
  • secondary candidate phrases having similarities greater than a predetermined value secondary candidate phrases with low weight values are eliminated and the remaining secondary candidate phrases are determined as topics (operation S260).
  • FIGs. 3 to 9 are views for explaining a method of extracting topics according to an embodiment of the present invention.
  • a database 330 is obtained from words constituting the titles
  • phrases composed of words of appearance frequencies greater than a predetermined value are extracted from the titles 320 to make primary candidate phrases 340. It can be seen from Fig. 5 that there are six titles each composed of a string of consecutive words 'Neowiz', 'Yogurting', 'RPG', 'Search_corporation', 'Jukeon', Popularized', 'Announces', 'Music', 'Service', and 'Mobile_carrier' among the fourteen titles 320 in Fig. 3.
  • secondary candidate phrases 350 are created with a combination of phrases composed of the words. Appearance frequencies 351 of phrases including the secondary candidate phrases 350 in the primary candidate phrases 340 are extracted. As described in Fig. 2, weight values 352 of the secondary candidate phrases 350 are calculated by referring to document IDs included in the secondary candidate phrases 350, appearance frequencies of words constituting the secondary candidate phrases 350, and the number of primary candidate phrases 340 used in the secondary candidate phrases 350. It can be seen form Fig.
  • strings 353 of document IDs of documents including the secondary candidate phrases 350 are extracted to calculate similarities between the secondary candidate phrases 350.
  • documents containing the phrase 'Announces RPG yogurting popularized' are (7, 10)
  • documents containing the phrase 'Neowiz yogurting' are (1, 5, 7, 10)
  • documents containing the phrase 'Neowiz search_corporation' are (2, 4, 12)
  • documents containing the phrase 'Neowiz search' are (2, 4, 8, 12).
  • the similarity between the phrases 'Announces RPG yogurting popularized' and 'Neowiz yogurting' is 66%, the similarity is regarded to be low.
  • Fig. 10 is a flow chart of a method of extracting issues according to an embodiment of the present invention.
  • data having the same or similar data greater than a predetermined threshold value is extracted from stored data.
  • a plurality of high-ranking data is extracted as issue data from the extracted data.
  • the issue data is displayed in order of writing time of the issue data or in order of a number of similar documents.
  • the stored data may be all of the Internet documents, specific blogs, data on news sites, or data obtained from predetermined search methods.
  • target documents on the Internet or target documents matching with a search word are extracted (operation S410).
  • the extracted documents may be the same or similar to one another.
  • documents having appearance frequencies greater than a predetermined value are extracted (operation S420).
  • High-ranking documents having a number of the same or similar documents are extracted as issues (operation S430).
  • the extracted issues are output in order of writing time of the documents or the number of same or similar documents (operation S440).
  • FIGs. 11 to 13 are views for explaining a method of extracting issues according to an embodiment of the present invention.
  • Fig. 14 is an issue output result.
  • Issues may be extracted from the whole target documents on the Internet and displayed as described above. As described in Figs. 2 to 9, topics may be extracted from the target documents and issues may be extracted from the topics and displayed.
  • Fig. 15 is another issue output result.
  • FIG. 16 is a block diagram of an information search apparatus according to an embodiment of the present invention.
  • the information search apparatus includes a web data storage unit 810, a searching unit 820, a primary candidate phrase extracting unit 830, a secondary candidate phrase extracting unit 840, a similar candidate phrase eliminating unit 850, and a topic output unit 860.
  • the web data storage unit 810 collects and stores documents on the Internet.
  • the searching unit 820 uses typical search methods to search for the documents.
  • the primary candidate phrase extracting unit 830 sequentially assigns document IDs to the documents in appearance order of the documents, and extracts documents having document IDs less than a predetermined value. A method of extracting the primary candidate phrases is described above in detail with reference to Fig. 2.
  • the secondary candidate phrase extracting unit 840 extracts words contained in titles or content of the documents and appearance frequencies of the words, extracts documents containing words of appearance frequencies greater than a predetermined value in the titles or content as primary candidate phrases, generates secondary candidate phrases composed of combinations of phrases obtained from the words constituting the primary candidate phrases, and calculates weight values of the secondary candidate phrases.
  • the similar candidate phrase eliminating unit 850 uses vectors consisting of document IDs of documents belonging to secondary candidate phrases with weight values greater than a predetermined value to calculate similarities between the secondary candidate phrases.
  • the similar candidate phrase eliminating unit 850 eliminates secondary candidate phrases with lower weight values among secondary candidate phrases with similarities greater than a predetermined value, and sets the remaining secondary candidate phrases as topics.
  • the topic output unit 860 sets the topics as titles and outputs the topics and documents corresponding to the topics.
  • the above-mentioned methods of extracting topics and issues may be written with computer programs. Codes and code segments constituting the programs can be easily deduced by computer programmers skilled in the art.
  • the programs are stored in computer readable media, read and executed by computers, thereby implementing the methods of extracting topics and issues. Examples of the computer readable media include magnetic recording media, optical recording media, and carrier wave media.
  • the present invention can be efficiently applied to industrial fields related to a method and apparatus for extracting topics from search results and providing the search results based on the topics, and a method and apparatus for selecting and providing frequently appearing search results as issues.

Abstract

Disclosed is a method of displaying search results with respect to a search word, including: (a) referring to words contained in titles or content of search results matching with the search word to calculate similarities between the search results according to a predetermined similarity calculation method, and extracting representative phrases among combinations of words repeatedly contained in similar search results; and (b) displaying the representative phrases and the search results that belong to each of the representative phrases.

Description

Description
METHOD OF EXTRACTING TOPICS AND ISSUES AND
METHOD AND APPARATUS FOR PROVIDING SEARCH
RESULTS BASED ON TOPICS AND ISSUES
Technical Field
[1] The present invention relates to an information search technology and, more particularly, to a method and apparatus for extracting topics from search results and providing the search results based on the topics, and a method and apparatus for selecting and providing frequently appearing search results as issues.
Background Art
[2] A conventional search system groups search results into groups based on their types, sequentially provides the search results based on similarities with search words, or places search results that are most similar to the search words at the top of search pages.
[3] However, there is a problem in the conventional search system in that too many redundant search results appear and most of the search results are useless since users tend to view only a few search results appearing at the top of the search pages. Disclosure of Invention Technical Solution
[4] The present invention provides a method and apparatus for searching for information based on topics by extracting phrases constituting search results to select topics and outputting the search results topic-by-topic so that users can obtain desired information more easily.
[5] The present invention further provides a method of searching for information based on issues by outputting Internet search results as issues in order of appearance frequencies of the search results. Advantageous Effects
[6] According to the present invention, users can use search results more efficiently since the users can easily grasp the search results and are not provided with repeated search results. Brief Description of the Drawings
[7] The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings in which:
[8] Fig. 1 is a view for explaining a method of providing search results based on topics according to an embodiment of the present invention;
[9] Fig. 2 is a flow chart of a method of extracting topics according to an embodiment of the present invention;
[10] Figs. 3 to 9 are views for explaining a method of extracting topics according to an embodiment of the present invention;
[11] Fig. 10 is a flow chart of a method of extracting issues according to an embodiment of the present invention;
[12] Figs. 11 to 13 are views for explaining a method of extracting issues according to an embodiment of the present invention;
[13] Fig. 14 is an issue output result;
[14] Fig. 15 is another issue output result; and
[15] Fig. 16 is a block diagram of an information search apparatus according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
[16] According to an aspect of the present invention, there is provided a method of displaying search results with respect to a search word, including: (a) referring to words contained in titles or content of search results matching with the search word to calculate similarities between the search results according to a predetermined similarity calculation method, and extracting representative phrases among combinations of words repeatedly contained in similar search results; and (b) displaying the representative phrases and the search results that belong to each of the representative phrases.
[17] According to another aspect of the present invention, there is provided a method of extracting topics, including: (a) assigning document IDs to documents with respect to a search word based on appearance orders of the documents, and extracting documents with document IDs less than a predetermined value; (b) extracting words contained in titles or content of the extracted documents and appearance frequencies of the words; (c) extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents; (d) generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; (e) calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases; and (f) eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
[ 18] According to another aspect of the present invention, there is provided a method of extracting issues, including: (a) extracting the same or similar data the number of which is greater than a predetermined threshold value among stored data; and (b) extracting as issue data a plurality of high-ranking data among the extracted data and displaying the issue data in order of writing time of the issue data or in order of a number of similar documents.
[19] According to another aspect of the present invention, there is provided an apparatus for providing search services based on extracted topics, including: a searching unit searching for stored documents; a primary candidate phrase extracting unit sequentially assigning document IDs to searched documents based on appearance orders of the searched documents, and extracting documents with document IDs less than a predetermined value; a secondary candidate phrase extracting unit extracting words contained in titles or content of the extracted documents and appearance frequencies of the words, extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents, generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; and a similar candidate phrase eliminating unit calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases, eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics. Mode for the Invention
[20] Exemplary embodiments in accordance with the present invention will now be described in detail with reference to the accompanying drawings.
[21] Fig. 1 is a view for explaining a method of providing search results based on topics according to an embodiment of the present invention.
[22] Referring to Fig. 1, search results are grouped into groups having similar phrases, and topics are extracted from the groups. For instance, it is assumed that 'A=Neowiz', 'B= Separated_search_serive', 'C=Pmang', 'D=Special_force',
'E=Founding_anniversary', and 'F=Party'. When various search results including 'A, B, C, D, E, and F are output, 1ABE', 1ABF', and 'ABD' may be grouped into a group, 'CDE', 'CDF' and 'CDG' may be grouped into a group, and 'AEFG', 'AEFH', and 'AEFI' may be grouped into a group. In this case, 'AB' becomes a topic 100, 'CD' becomes a topic and 'AEF' becomes a topic. The term 'topic' implies an expression indicating a subject of search results.
[23] Fig. 2 is a flow chart of a method of extracting topics according to an embodiment of the present invention.
[24] Referring to Fig. 2, when a search word is input, similarities between search results are calculated according to a similarity calculation method by referring to words that are included in titles or content in the search results and match with the search word. Further, representative phrases are extracted among a combination of duplicate words in similar search results, and search results are displayed according to the extracted representative phrases.
[25] In more detail, document IDs are sequentially assigned to documents matching with a search word based on appearance orders of the documents, and documents with documents IDs less than a predetermined value are extracted (operation S210). The predetermined value may vary based on the number of search results, i.e., documents, or the like. Data composed of 'words' which are included in titles or content of the documents, and 'Appearance frequencies of the words' are stored (operation S220). Next, primary candidate phrases composed of words of appearance frequencies greater than a predetermined value in the titles or content of the documents are extracted (operation S230). The predetermined value may vary according to the number of primary candidate phrases to be extracted.
[26] Next, secondary candidate phrases are generated from combinations of phrases composed of the words constituting the primary candidate phrases, and weight values of the secondary candidate phrases are calculated (operation S240). The weight values of secondary candidate phrases are calculated by referring to document IDs included in the secondary candidate phrases, appearance frequencies of words constituting the secondary candidate phrases, and the number of primary candidate phrases used in the secondary candidate phrases. For instance, since a document with a low document ID is important, its weight value becomes high. In addition, if appearance frequency of words constituting the secondary candidate phrase is high, it is regarded as an important document. Further, if document ID included in the secondary candidate phrase is low, it is regarded as an important document.
[27] Next, similarities between secondary candidate phrases with weight values greater than a predetermined value are calculated by use of vectors consisting of document IDs of documents that belong to the secondary candidate phrases (operation S250). That is, when there are several document IDs, the similarities are calculated by referring to the number of the same document IDs. Among secondary candidate phrases having similarities greater than a predetermined value, secondary candidate phrases with low weight values are eliminated and the remaining secondary candidate phrases are determined as topics (operation S260).
[28] The topics and the documents belonging to individual topics are displayed.
[29] Figs. 3 to 9 are views for explaining a method of extracting topics according to an embodiment of the present invention.
[30] As shown in Fig. 3, when a search word 'Neowiz' is entered, titles 320 appear as search results and document IDs 310 are assigned to the titles based on appearance orders of the titles.
[31] As shown in Fig. 4, a database 330 is obtained from words constituting the titles
320 and appearance frequencies of the words. It can be seen from Fig. 4 that a word 'Neowiz' appears thirteen times and a word 'Yogurting' appears four times in the titles 320. The appearance frequencies of the other words are obtained in this manner. Words of appearance frequencies less than a predetermined value are eliminated. In Fig. 4, a word 'Showdown' appears once and is eliminated.
[32] Next, phrases composed of words of appearance frequencies greater than a predetermined value are extracted from the titles 320 to make primary candidate phrases 340. It can be seen from Fig. 5 that there are six titles each composed of a string of consecutive words 'Neowiz', 'Yogurting', 'RPG', 'Search_corporation', 'Jukeon', Popularized', 'Announces', 'Music', 'Service', and 'Mobile_carrier' among the fourteen titles 320 in Fig. 3.
[33] Next, as shown in Fig. 6, secondary candidate phrases 350 are created with a combination of phrases composed of the words. Appearance frequencies 351 of phrases including the secondary candidate phrases 350 in the primary candidate phrases 340 are extracted. As described in Fig. 2, weight values 352 of the secondary candidate phrases 350 are calculated by referring to document IDs included in the secondary candidate phrases 350, appearance frequencies of words constituting the secondary candidate phrases 350, and the number of primary candidate phrases 340 used in the secondary candidate phrases 350. It can be seen form Fig. 7 that the phrase 'Announces RPG yogurting popularized' has a weight value of 1732, the phrase 'Neowiz Jukeon' has a weight value of 1720, the phrase 'Neowiz search_corporation' has a weight value of 1710, and the phrase 'Neowiz Jukeon mobile_carrier' has a weight value of 1320. The phrase 'Jukeon mobile_carrier music' having a weight value of 1200 is discarded. Thus, a reference weight value to eliminate phrases is 1200.
[34] Referring to Fig. 8, strings 353 of document IDs of documents including the secondary candidate phrases 350 are extracted to calculate similarities between the secondary candidate phrases 350. For instance, it is assumed that documents containing the phrase 'Announces RPG yogurting popularized' are (7, 10), documents containing the phrase 'Neowiz yogurting' are (1, 5, 7, 10), documents containing the phrase 'Neowiz search_corporation' are (2, 4, 12), and documents containing the phrase 'Neowiz search' are (2, 4, 8, 12). In this case, since the similarity between the phrases 'Announces RPG yogurting popularized' and 'Neowiz yogurting' is 66%, the similarity is regarded to be low. Since the similarity between the phrases 'Neowiz search_corporation' and 'Neowiz search' is 82%, the phrase 'Neowiz search' having a lower weight value is eliminated from the secondary candidate phrases. In this manner, topics 361 and search results topic-by-topic are obtained as shown in Fig. 9.
[35] Fig. 10 is a flow chart of a method of extracting issues according to an embodiment of the present invention.
[36] First, data having the same or similar data greater than a predetermined threshold value is extracted from stored data. A plurality of high-ranking data is extracted as issue data from the extracted data. The issue data is displayed in order of writing time of the issue data or in order of a number of similar documents. The stored data may be all of the Internet documents, specific blogs, data on news sites, or data obtained from predetermined search methods.
[37] In more detail, target documents on the Internet or target documents matching with a search word are extracted (operation S410). The extracted documents may be the same or similar to one another. After the number of same or similar documents is calculated, documents having appearance frequencies greater than a predetermined value are extracted (operation S420).
[38] High-ranking documents having a number of the same or similar documents are extracted as issues (operation S430). The extracted issues are output in order of writing time of the documents or the number of same or similar documents (operation S440).
[39] Figs. 11 to 13 are views for explaining a method of extracting issues according to an embodiment of the present invention.
[40] When there are Internet data 510 as shown in FIG. 11 , the data 510 are arranged in order of document title 520 and its appearance frequency 521 as shown in FIG. 12. Documents of appearance frequencies less than a predetermined value are eliminated. In this case, documents of appearance frequencies less than two hundreds are eliminated. The remaining documents are selected as issues and output in order of recent writing date as shown in Fig. 13.
[41] Fig. 14 is an issue output result.
[42] Issues may be extracted from the whole target documents on the Internet and displayed as described above. As described in Figs. 2 to 9, topics may be extracted from the target documents and issues may be extracted from the topics and displayed.
[43] Fig. 15 is another issue output result.
[44] Issues and topics may be displayed as shown in Fig. 15. For instance, issues 720 and topics 730 corresponding to a search word 'Neowiz' 710 may be displayed at different positions. [45] Fig. 16 is a block diagram of an information search apparatus according to an embodiment of the present invention.
[46] The information search apparatus includes a web data storage unit 810, a searching unit 820, a primary candidate phrase extracting unit 830, a secondary candidate phrase extracting unit 840, a similar candidate phrase eliminating unit 850, and a topic output unit 860.
[47] The web data storage unit 810 collects and stores documents on the Internet. The searching unit 820 uses typical search methods to search for the documents. The primary candidate phrase extracting unit 830 sequentially assigns document IDs to the documents in appearance order of the documents, and extracts documents having document IDs less than a predetermined value. A method of extracting the primary candidate phrases is described above in detail with reference to Fig. 2. The secondary candidate phrase extracting unit 840 extracts words contained in titles or content of the documents and appearance frequencies of the words, extracts documents containing words of appearance frequencies greater than a predetermined value in the titles or content as primary candidate phrases, generates secondary candidate phrases composed of combinations of phrases obtained from the words constituting the primary candidate phrases, and calculates weight values of the secondary candidate phrases.
[48] The similar candidate phrase eliminating unit 850 uses vectors consisting of document IDs of documents belonging to secondary candidate phrases with weight values greater than a predetermined value to calculate similarities between the secondary candidate phrases. The similar candidate phrase eliminating unit 850 eliminates secondary candidate phrases with lower weight values among secondary candidate phrases with similarities greater than a predetermined value, and sets the remaining secondary candidate phrases as topics. The topic output unit 860 sets the topics as titles and outputs the topics and documents corresponding to the topics.
[49] The above-mentioned methods of extracting topics and issues may be written with computer programs. Codes and code segments constituting the programs can be easily deduced by computer programmers skilled in the art. In addition, the programs are stored in computer readable media, read and executed by computers, thereby implementing the methods of extracting topics and issues. Examples of the computer readable media include magnetic recording media, optical recording media, and carrier wave media.
[50] While the present invention has been described with reference to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the present invention as defined by the following claims. Industrial Applicability
[51] The present invention can be efficiently applied to industrial fields related to a method and apparatus for extracting topics from search results and providing the search results based on the topics, and a method and apparatus for selecting and providing frequently appearing search results as issues.

Claims

Claims
[1] A method of displaying search results with respect to a search word, comprising:
(a) referring to words contained in titles or content of search results matching with the search word to calculate similarities between the search results according to a predetermined similarity calculation method, and extracting representative phrases among combinations of words repeatedly contained in similar search results; and
(b) displaying the representative phrases and the search results that belong to each of the representative phrases.
[2] The method of claim 1, wherein the operation (a) comprises:
(al) extracting words contained in titles or content of the search results matching with the search word, and extracting primary candidate phrases in which at least one of the words consecutively appears; and
(a2) generating secondary candidate phrases from words constituting the primary candidate phrases, calculating significance of the secondary candidate phrases based on appearance orders of the search results, appearance frequencies of the words, and the number of primary candidate phrases used in the secondary candidate phrases, and extracting representative phrases by eliminating similar candidate phrases from the secondary candidate phrases of higher significance.
[3] A method of extracting topics, comprising:
(a) assigning document IDs to documents with respect to a search word based on appearance orders of the documents, and extracting documents with document IDs less than a predetermined value;
(b) extracting words contained in titles or content of the extracted documents and appearance frequencies of the words;
(c) extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents;
(d) generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases;
(e) calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases; and
(f) eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
[4] The method of claim of 3, further including (g) displaying the topics as titles and documents that belong to each of the topics.
[5] The method of claim of 3, wherein the operation (d) comprises: generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases; and calculating weight values of the secondary candidate phrases based on document IDs contained in the secondary candidate phrases, appearance frequencies of the words constituting the secondary candidate phrases, and the number of the primary candidate phrases used in the secondary candidate phrases.
[6] A method of extracting issues, comprising:
(a) extracting the same or similar data the number of which is greater than a predetermined threshold value among stored data; and
(b) extracting as issue data a plurality of high-ranking data among the extracted data and displaying the issue data in order of writing time of the issue data or in order of a number of similar documents.
[7] The method of claim 6, wherein the stored data is data obtained by a predetermined search method.
[8] The method of claim 6, wherein the operation (a) includes determining the same or similar data based on words contained in titles or content of stored data, and extracting the same or similar data the number of which is greater than a predetermined threshold value.
[9] An apparatus for providing search services based on extracted topics, comprising: a searching unit searching for stored documents; a primary candidate phrase extracting unit sequentially assigning document IDs to searched documents based on appearance orders of the searched documents, and extracting documents with document IDs less than a predetermined value; a secondary candidate phrase extracting unit extracting words contained in titles or content of the extracted documents and appearance frequencies of the words, extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents, generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; and a similar candidate phrase eliminating unit calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases, eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
[10] The apparatus of claim 9, further including a topic output unit displaying the topics as titles and documents that belong to each of the topics.
[11] The apparatus of claim 9, wherein the secondary candidate phrase extracting unit generates secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculates weight values of the secondary candidate phrases based on document IDs contained in the secondary candidate phrases, appearance frequencies of the words constituting the secondary candidate phrases, and the number of the primary candidate phrases used in the secondary candidate phrases.
[12] Computer readable media storing programs for executing on a computer the method of claim 1 or 2.
PCT/KR2006/002787 2005-07-15 2006-07-14 Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues WO2007011140A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2005-0064515 2005-07-15
KR20050064515 2005-07-15

Publications (1)

Publication Number Publication Date
WO2007011140A1 true WO2007011140A1 (en) 2007-01-25

Family

ID=37668993

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2006/002787 WO2007011140A1 (en) 2005-07-15 2006-07-14 Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues

Country Status (1)

Country Link
WO (1) WO2007011140A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008098282A1 (en) * 2007-02-16 2008-08-21 Funnelback Pty Ltd Search result sub-topic identification system and method
JP2014059865A (en) * 2012-09-14 2014-04-03 Hon Hai Precision Industry Co Ltd Retrieval system and method thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924090A (en) * 1997-05-01 1999-07-13 Northern Light Technology Llc Method and apparatus for searching a database of records
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
KR20000050225A (en) * 2000-05-29 2000-08-05 전상훈 Internet information searching system and method by document auto summation
US6212517B1 (en) * 1997-07-02 2001-04-03 Matsushita Electric Industrial Co., Ltd. Keyword extracting system and text retrieval system using the same
KR20040029895A (en) * 2002-10-02 2004-04-08 씨씨알 주식회사 Search system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924090A (en) * 1997-05-01 1999-07-13 Northern Light Technology Llc Method and apparatus for searching a database of records
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US6212517B1 (en) * 1997-07-02 2001-04-03 Matsushita Electric Industrial Co., Ltd. Keyword extracting system and text retrieval system using the same
KR20000050225A (en) * 2000-05-29 2000-08-05 전상훈 Internet information searching system and method by document auto summation
KR20040029895A (en) * 2002-10-02 2004-04-08 씨씨알 주식회사 Search system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008098282A1 (en) * 2007-02-16 2008-08-21 Funnelback Pty Ltd Search result sub-topic identification system and method
AU2008215153B2 (en) * 2007-02-16 2012-02-16 Squiz Pty Ltd Search result sub-topic identification system and method
AU2008215153B9 (en) * 2007-02-16 2012-03-01 Squiz Pty Ltd Search result sub-topic identification system and method
US8214347B2 (en) 2007-02-16 2012-07-03 Funnelback Pty Ltd. Search result sub-topic identification system and method
JP2014059865A (en) * 2012-09-14 2014-04-03 Hon Hai Precision Industry Co Ltd Retrieval system and method thereof

Similar Documents

Publication Publication Date Title
US11803596B2 (en) Efficient forward ranking in a search engine
US8117026B2 (en) String matching method and system using phonetic symbols and computer-readable recording medium storing computer program for executing the string matching method
US7257574B2 (en) Navigational learning in a structured transaction processing system
KR101255405B1 (en) Indexing and searching speech with text meta-data
US7979268B2 (en) String matching method and system and computer-readable recording medium storing the string matching method
US8713024B2 (en) Efficient forward ranking in a search engine
JP5241828B2 (en) Dictionary word and idiom determination
US8781817B2 (en) Phrase based document clustering with automatic phrase extraction
US7863510B2 (en) Method, medium, and system classifying music themes using music titles
JP4977589B2 (en) Specific expression extraction device, specific expression extraction method, and program
KR100847376B1 (en) Method and apparatus for searching information using automatic query creation
JP6524008B2 (en) INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND PROGRAM
MX2012011923A (en) Ascribing actionable attributes to data that describes a personal identity.
JP2011165131A (en) Information processor, method, and program
WO2007011129A1 (en) Information search method and information search apparatus on which information value is reflected
JP2008146424A (en) Xml document conformity calculation method, its program, and information processor
CN104657376A (en) Searching method and searching device for video programs based on program relationship
JP5302614B2 (en) Facility related information search database formation method and facility related information search system
JP2007334388A (en) Method and device for clustering, program, and computer-readable recording medium
US20090216739A1 (en) Boosting extraction accuracy by handling training data bias
WO2007011140A1 (en) Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues
JP4982542B2 (en) Co-occurrence matrix generation device, co-occurrence matrix generation method, co-occurrence matrix generation program, and recording medium recording the program
CN112507687A (en) Work order retrieval method based on secondary sorting
JP5547030B2 (en) Information analysis apparatus, method and program
JPH1196170A (en) Data base generating method, method and device for information retrieval, and recording medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1)EPC

122 Ep: pct application non-entry in european phase

Ref document number: 06783312

Country of ref document: EP

Kind code of ref document: A1