WO2007011129A1 - Information search method and information search apparatus on which information value is reflected - Google Patents

Information search method and information search apparatus on which information value is reflected Download PDF

Info

Publication number
WO2007011129A1
WO2007011129A1 PCT/KR2006/002758 KR2006002758W WO2007011129A1 WO 2007011129 A1 WO2007011129 A1 WO 2007011129A1 KR 2006002758 W KR2006002758 W KR 2006002758W WO 2007011129 A1 WO2007011129 A1 WO 2007011129A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
documents
text
similarities
groups
Prior art date
Application number
PCT/KR2006/002758
Other languages
French (fr)
Inventor
Seung-Jun Lee
Hyung-Gon Kim
Byung-Hak Kim
Seo-Dong Nam
Joong-Ho Shin
Original Assignee
Chutnoon Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chutnoon Inc. filed Critical Chutnoon Inc.
Priority to JP2008521324A priority Critical patent/JP4896132B2/en
Publication of WO2007011129A1 publication Critical patent/WO2007011129A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to an information search technology and, more particularly, to an information search method and information search apparatus for providing or recommending information to users based on the significance of information.
  • search engines play an important role in extracting desired one of the information.
  • Conventional search engines are intended to search for more information, while current search engines are required to search for and selectively offer desired information to users. For this purpose, it is necessary to offer information to users based on the significance of information.
  • a conventional search method calculates a similarity between a search word and a search document. That is, the similarity is calculated based on the number of times the search word appears in the search document. In a case when a search word of "Neowiz" appears ten times in a document and appears five times in another document, the similarities are 100% and 50% in the document and the other document, respectively.
  • Boolean search model extended Boolean search model, vector space model, probability distribution, Poisson model, and Lagrangian model may be used to calculate the similarities.
  • these methods calculate the similarities simply based on the number of times search words appear, they cannot calculate the similarities on which the significance of information is reflected.
  • the significance of information may be calculated using hyperlinked web pages. That is, the significance of information is calculated based on the number of Internet links referring to the information. For instance, the more the number of times other sites refer to a search document, the more significant the search document.
  • the method cannot be applied to all kinds of information. For instance, since the number of sites linked to Korean documents is relatively smaller than the number of sites linked to English documents, the above-mentioned method cannot be equally applied to the above-mentioned case. Disclosure of Invention Technical Solution
  • the present invention provides an information search method and information search apparatus for grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups.
  • FIG. 1 is a view for explaining a method of grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups;
  • FIG. 2 is a flow chart of a text search method on which the significance of information is reflected, according to an embodiment of the present invention
  • FIG. 3 is a detailed flow chart of the text search method of Fig. 2;
  • Figs. 4 to 6 are a process of extracting a set of index keywords from a text document
  • FIGs. 7 to 8 are views for explaining a method of calculating similarities between documents with a set of index keywords and searching for the same and similar documents;
  • Fig. 9 is a flow chart of a method of reducing the number of documents of which similarities are to be calculated
  • Fig. 10 is a text search apparatus on which the significance of information is reflected, according to an embodiment of the present invention.
  • Fig. 11 is a result obtained from a text search method according to the present invention.
  • Best Mode for Carrying Out the Invention including: (a) calculating similarities between a plurality of information; (b) grouping the same information into groups based on the similarities and calculating value of the information based on the number of information that are regarded as substantially the same information; and (c) displaying information search results on which the value is reflected.
  • the operation (a) may include: (al) dividing the text information into groups based on the number of words and particles contained in the text information; (a2) generating inverted files with respect to each of the words in the groups; (a3) removing text information of frequencies less than a predetermined threshold value from analysis of the inverted files to select text information of which similarities are to be calculated; and (a4) calculating similarities between the selected text information and grouping information regarded as substantially the same text information into groups.
  • the operation (a4) may put a higher weight value on the title than on the main body to calculate the similarities.
  • an information search apparatus including: a text document storage unit storing text documents among information collected on the Internet; a similarity analyzing unit calculating similarities between the text documents; a representative document extracting unit grouping documents regarded as the same documents into groups based on the similarities and extracting a representative document from each of the groups; a similar document extracting unit extracting documents regarded as similar documents based on the similarities; and a searching unit displaying representative documents and similar documents corresponding to a search word in order of a representative document of higher appearance frequency and providing the similar documents that are linked.
  • Fig. 1 is a view for explaining a method of grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups.
  • Information collected on the Internet are grouped into groups each having the same content.
  • the term 'same content' does not imply 'exactly identical content' but 'content having a similarity more than a predetermined threshold value', i.e., 'substantially the same content.' That is, in a case where some sites commonly have information having the same content with respect to a search word, the information is grouped into a group. For instance, in case of a search word of "Neowiz", there may be several Internet sites containing the content that "...there appears a new search engine that can search for all the information on the Internet.
  • a group A IlO may include the above- mentioned information.
  • a group B 120 may include a set of information containing the content that "...[Neowiz/Sayclub] Introduction to Neowiz and E-community"
  • a group C 130 may include a set of information containing the content that "...Management of Neowiz, Card, Casual, Mobile, Go-Stop game".
  • Representative information is extracted from the group and displayed to a user.
  • the representative information implies information representative of the group, and may be the most recent information or information containing images in the group.
  • Search results are displayed to the user based on the number of times information containing the search word appears, such that valuable information is easily found by the user.
  • FIG. 2 is a flow chart of a text search method on which the significance of information is reflected, according to an embodiment of the present invention.
  • operation S210 desired information is collected (operation S210), and a similarity between the information is calculated (operation S220).
  • a similarity between the information is calculated (operation S220).
  • 100 100 calculations are required to obtain similarities between the hundred pieces of information.
  • a method of calculating the similarity will be described in detail with reference to Figs. 3, 4 to 8.
  • a plurality of information having the same content is grouped into a group, duplicate information is removed, and representative information is extracted (operation S230).
  • the significance of information is calculated based on the number of substantially the same information (operation S240).
  • the representative information is output based on the significance of group (operation S250).
  • Representative information of a group of high appearance frequency of information is regarded as information with high significance, such that the representative information is placed at the top of an output screen or is highlighted on the output screen.
  • FIG. 3 is a detailed flow chart of the text search method shown in Fig. 2.
  • Index keywords are extracted from texts in documents to calculate similarities between the documents (operation S310).
  • the keywords are compared with each other to calculate the similarities between the documents (operation S320).
  • the similarities may be calculated by assigning different weight values to a title and a main body of the document. For instance, when both documents have a lot of keywords similar to each other in their titles, there is a good possibility that both of the documents are similar to each other. Thus, the weight value may be assigned to the title upon calculating the similarities.
  • the same and similar documents are determined based on the similarities (operation S330).
  • a representative document is extracted from each of the groups (operation S340).
  • the representative document is provided to users based on its significance (operation S350).
  • Figs. 4 to 6 are a process of extracting a set of index keywords from a text document.
  • a document 410 consists of a word string 401 with respect to a title, and a word string 402 with respect to a main body.
  • the title is "Search Business Separated from Neowiz” 421, and the main body is "A new service provider, lnoon, separated from Neowiz launches services on a full scale.
  • the lnoon (lnoon.com) expects to conduct a beta test as early as next month and then commence formal services on this coming October. From this year,"
  • a set of index keywords 430 includes 'Neowiz, Search, and Separated' that are extracted as keywords for the title, and 'Neowiz, Separated, Search, lnoon, Test, and Commence' that are extracted as keywords for the main body.
  • FIGs. 7 and 8 are views for explaining a method of calculating similarities between documents with a set of index keywords and searching for the same and similar documents.
  • Fig. 7 is a view for explaining a similarity comparison with reference to Figs. 4 to 6.
  • Similarities between documents A and B, between documents A and C, and between documents A and D are 75%, 4%, and 96%, respectively.
  • the similarities can be calculated according to the above-mentioned methods. For instance, the similarities can be calculated by comparing the keywords for title and for content under the same condition, or by putting a higher weight value on the keyword for title.
  • Fig. 8 is a view for explaining a method of searching for documents identical and similar to each of documents based on the similarities shown in Fig. 7.
  • a reference similarity used to determine the same and similar documents may vary.
  • Fig. 8 shows that the number of the same documents as document A is twenty five, the same documents are documents B, D and so on, and similar documents are documents X, T and so on.
  • Fig. 9 is a flow chart of a method of reducing the number of documents of which similarities are to be calculated.
  • documents are first grouped into groups based on the number of words and particles constituting the documents (operation S610). If documents are similar to one another in the number of words and particles constituting documents, there is a good possibility that the documents are similar to one another. Thus, the documents are grouped into the same groups.
  • a reference for grouping may vary. For instance, a document may be grouped into the same group every five words and particles, or every different number of words and particles.
  • An inverted file is generated for each of the groups (operation S620).
  • the inverted file is generated by extracting words constituting the documents and collecting IDs of documents containing the words. For instance, in a case where there are documents DocID 1, DocID2,..., and DocID 100, and DocID 1 includes words A, B, C,..., and J, the following inverted files may be generated to search for documents similar to document DocID 1 :
  • the inverted files are analyzed, and documents of frequency less than a threshold value are removed (operation S630).
  • document DocID 4 of low appearance frequency is excluded and no longer compared.
  • documents DocIDs of low appearance frequency are excluded, thereby greatly reducing the number of documents to be compared with document DocID 1.
  • Fig. 10 is a text search apparatus on which the significance of information is reflected, according to an embodiment of the present invention.
  • the text search apparatus includes a web data storage unit 710, a text document storage unit 720, a similarity analyzing unit 730, a representative document extracting unit 740, a similar document extracting unit 750, a searching unit 760, and an information recommendation unit 770.
  • the web data storage unit 710 collects and stores information existing on the Internet.
  • the text document storage unit 720 stores text documents among the information.
  • the similarity analyzing unit 730 groups the text documents into groups based on the number of words and particles contained in the text documents, generates inverted files with respect to each of the words, removes text documents of frequency less than a predetermined threshold value to select text documents of which similarities are to be calculated, and calculates the similarities between the documents.
  • the representative document extracting unit 740 groups documents regarded as the same text documents into the same groups, and extracts a representative document from each of the groups. As described above, the most recent document or document containing images may be extracted as the representative document.
  • the similar document extracting unit 750 extracts, as similar text documents, documents having similarities more than a predetermined value determined based on similarities calculated by the similarity analyzing unit 730.
  • the searching unit 760 searches the representative document storage unit 740 and the similar document storage unit 750. At this time, a valuable one of the representative documents is placed at the top of a search result page. Information concerning similar documents is linked so that the user can view it.
  • the information recommendation unit 770 outputs valuable information according to a predetermined condition. For instance, information frequently appearing on the Internet is determined as valuable information, and is thus automatically output among representative documents even though the user does not enter the search word. For example, since documents appearing more than one thousand times a day are regarded as an important issue, the documents need to be automatically output.
  • Fig. 11 is a result obtained from a text search method according to the present invention.
  • a sentence of high similarity is placed at the top of an output screen.
  • the similarity is determined based on the above-mentioned method, and a document having a lot of similar sentences is regarded as a relatively significant document. For instance, if "Park Ji-Sung" is entered in a search window, search results are output in order of document significance.
  • the most significant document is a document of highest appearance frequency. For example, in Fig. 11, the most significant document is an item "Photos in Park's home" 810. If an item "Similar documents" 820 is selected, its detailed content 820-1 is displayed on a new or current window.
  • Codes and code segments constituting the programs can be easily deduced by computer programmers skilled in the art.
  • the programs are stored in computer readable media, read and executed by computers, thereby implementing the text search method.
  • Examples of the computer readable media include magnetic recording media, optical recording media, and carrier wave media.
  • the present invention can be applied to industrial fields related to an information search method of providing or recommending information to users based on the significance of information.

Abstract

An information search method is provided, including: (a) calculating similarities between a plurality of information; (b) grouping the same information into groups based on the similarities and calculating value of the information based on the number of information that are regarded as substantially the same information; and (c) displaying information search results on which the value is reflected.

Description

Description
INFORMATION SEARCH METHOD AND INFORMATION SEARCH APPARATUS ON WHICH INFORMATION VALUE IS
REFLECTED
Technical Field
[1] The present invention relates to an information search technology and, more particularly, to an information search method and information search apparatus for providing or recommending information to users based on the significance of information. Background Art
[2] With an explosive increase in information providers and users over the Internet, various kinds of information are currently overflowing in the Internet. Accordingly, search engines play an important role in extracting desired one of the information. Conventional search engines are intended to search for more information, while current search engines are required to search for and selectively offer desired information to users. For this purpose, it is necessary to offer information to users based on the significance of information.
[3] A conventional search method calculates a similarity between a search word and a search document. That is, the similarity is calculated based on the number of times the search word appears in the search document. In a case when a search word of "Neowiz" appears ten times in a document and appears five times in another document, the similarities are 100% and 50% in the document and the other document, respectively.
[4] Meanwhile, Boolean search model, extended Boolean search model, vector space model, probability distribution, Poisson model, and Lagrangian model may be used to calculate the similarities. However, since these methods calculate the similarities simply based on the number of times search words appear, they cannot calculate the similarities on which the significance of information is reflected.
[5] Meanwhile, the significance of information may be calculated using hyperlinked web pages. That is, the significance of information is calculated based on the number of Internet links referring to the information. For instance, the more the number of times other sites refer to a search document, the more significant the search document. However, the method cannot be applied to all kinds of information. For instance, since the number of sites linked to Korean documents is relatively smaller than the number of sites linked to English documents, the above-mentioned method cannot be equally applied to the above-mentioned case. Disclosure of Invention Technical Solution
[6] The present invention provides an information search method and information search apparatus for grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups. Advantageous Effects
[7] According to the present invention, since the significance of information is determined based on the number of the same information and information is displayed to users in order of the significance of information, it is possible to provide the users with desired information. In addition, since similar documents are linked, search results are convenient to refer to. Further, since duplicate information is excluded from the search results, it is possible to reduce the time and effort required to check the search results. Brief Description of the Drawings
[8] The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings in which:
[9] Fig. 1 is a view for explaining a method of grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups;
[10] Fig. 2 is a flow chart of a text search method on which the significance of information is reflected, according to an embodiment of the present invention;
[11] Fig. 3 is a detailed flow chart of the text search method of Fig. 2;
[12] Figs. 4 to 6 are a process of extracting a set of index keywords from a text document;
[13] Figs. 7 to 8 are views for explaining a method of calculating similarities between documents with a set of index keywords and searching for the same and similar documents;
[14] Fig. 9 is a flow chart of a method of reducing the number of documents of which similarities are to be calculated;
[15] Fig. 10 is a text search apparatus on which the significance of information is reflected, according to an embodiment of the present invention; and
[16] Fig. 11 is a result obtained from a text search method according to the present invention. Best Mode for Carrying Out the Invention [17] According to an aspect of the present invention, there is provided an information search method including: (a) calculating similarities between a plurality of information; (b) grouping the same information into groups based on the similarities and calculating value of the information based on the number of information that are regarded as substantially the same information; and (c) displaying information search results on which the value is reflected.
[18] The operation (a) may include: (al) dividing the text information into groups based on the number of words and particles contained in the text information; (a2) generating inverted files with respect to each of the words in the groups; (a3) removing text information of frequencies less than a predetermined threshold value from analysis of the inverted files to select text information of which similarities are to be calculated; and (a4) calculating similarities between the selected text information and grouping information regarded as substantially the same text information into groups.
[19] The operation (a4) may put a higher weight value on the title than on the main body to calculate the similarities.
[20] According to another aspect of the present invention, there is provided an information search apparatus including: a text document storage unit storing text documents among information collected on the Internet; a similarity analyzing unit calculating similarities between the text documents; a representative document extracting unit grouping documents regarded as the same documents into groups based on the similarities and extracting a representative document from each of the groups; a similar document extracting unit extracting documents regarded as similar documents based on the similarities; and a searching unit displaying representative documents and similar documents corresponding to a search word in order of a representative document of higher appearance frequency and providing the similar documents that are linked. Mode for the Invention
[21] Exemplary embodiments in accordance with the present invention will now be described in detail with reference to the accompanying drawings.
[22] Fig. 1 is a view for explaining a method of grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups.
[23] Information collected on the Internet are grouped into groups each having the same content. The term 'same content' does not imply 'exactly identical content' but 'content having a similarity more than a predetermined threshold value', i.e., 'substantially the same content.' That is, in a case where some sites commonly have information having the same content with respect to a search word, the information is grouped into a group. For instance, in case of a search word of "Neowiz", there may be several Internet sites containing the content that "...there appears a new search engine that can search for all the information on the Internet. One of search service companies, 'lnoon (http://www.lnoon.com)' which has been established by a second large stockholder of Neowiz, Byung-Kyu Chang, has recently developed a search engine that allows all the users to conveniently search for the Internet..." A group A IlO may include the above- mentioned information. A group B 120 may include a set of information containing the content that "...[Neowiz/Sayclub] Introduction to Neowiz and E-community..." A group C 130 may include a set of information containing the content that "...Management of Neowiz, Card, Casual, Mobile, Go-Stop game..."
[24] That is, a plurality of information having the same content is grouped into a group.
Representative information is extracted from the group and displayed to a user. The representative information implies information representative of the group, and may be the most recent information or information containing images in the group.
[25] Search results are displayed to the user based on the number of times information containing the search word appears, such that valuable information is easily found by the user.
[26] Fig. 2 is a flow chart of a text search method on which the significance of information is reflected, according to an embodiment of the present invention.
[27] First, desired information is collected (operation S210), and a similarity between the information is calculated (operation S220). Conventionally, in a case when a hundred pieces of information are collected, 100 100 calculations are required to obtain similarities between the hundred pieces of information. A method of calculating the similarity will be described in detail with reference to Figs. 3, 4 to 8. After the similarity calculation, a plurality of information having the same content is grouped into a group, duplicate information is removed, and representative information is extracted (operation S230). The significance of information is calculated based on the number of substantially the same information (operation S240). The representative information is output based on the significance of group (operation S250). Representative information of a group of high appearance frequency of information is regarded as information with high significance, such that the representative information is placed at the top of an output screen or is highlighted on the output screen.
[28] Fig. 3 is a detailed flow chart of the text search method shown in Fig. 2.
[29] A method of calculating a similarity between text documents and providing its search results will be described. Index keywords are extracted from texts in documents to calculate similarities between the documents (operation S310). The keywords are compared with each other to calculate the similarities between the documents (operation S320). The more the same index keywords appear, the more similar the documents are presumed to be. The similarities may be calculated by assigning different weight values to a title and a main body of the document. For instance, when both documents have a lot of keywords similar to each other in their titles, there is a good possibility that both of the documents are similar to each other. Thus, the weight value may be assigned to the title upon calculating the similarities. As a result, the same and similar documents are determined based on the similarities (operation S330). A representative document is extracted from each of the groups (operation S340). The representative document is provided to users based on its significance (operation S350).
[30] Figs. 4 to 6 are a process of extracting a set of index keywords from a text document.
[31] Referring to Fig. 4, a document 410 consists of a word string 401 with respect to a title, and a word string 402 with respect to a main body. Referring to Fig. 5, it is assumed that the title is "Search Business Separated from Neowiz" 421, and the main body is "A new service provider, lnoon, separated from Neowiz launches services on a full scale. The lnoon (lnoon.com) expects to conduct a beta test as early as next month and then commence formal services on this coming October. From this year,..." Referring to Fig. 6, a set of index keywords 430 includes 'Neowiz, Search, and Separated' that are extracted as keywords for the title, and 'Neowiz, Separated, Search, lnoon, Test, and Commence' that are extracted as keywords for the main body.
[32] Figs. 7 and 8 are views for explaining a method of calculating similarities between documents with a set of index keywords and searching for the same and similar documents.
[33] Fig. 7 is a view for explaining a similarity comparison with reference to Figs. 4 to 6.
Similarities between documents A and B, between documents A and C, and between documents A and D are 75%, 4%, and 96%, respectively. The similarities can be calculated according to the above-mentioned methods. For instance, the similarities can be calculated by comparing the keywords for title and for content under the same condition, or by putting a higher weight value on the keyword for title.
[34] Fig. 8 is a view for explaining a method of searching for documents identical and similar to each of documents based on the similarities shown in Fig. 7. A reference similarity used to determine the same and similar documents may vary. Fig. 8 shows that the number of the same documents as document A is twenty five, the same documents are documents B, D and so on, and similar documents are documents X, T and so on.
[35] Fig. 9 is a flow chart of a method of reducing the number of documents of which similarities are to be calculated. [36] It takes a large amount of calculation and a lot of time to make an index keyword list of all documents and to calculate similarities between the documents. Thus, it is necessary to reduce the number of the documents of which similarities are to be calculated. For this purpose, documents are first grouped into groups based on the number of words and particles constituting the documents (operation S610). If documents are similar to one another in the number of words and particles constituting documents, there is a good possibility that the documents are similar to one another. Thus, the documents are grouped into the same groups. A reference for grouping may vary. For instance, a document may be grouped into the same group every five words and particles, or every different number of words and particles.
[37] An inverted file is generated for each of the groups (operation S620). The inverted file is generated by extracting words constituting the documents and collecting IDs of documents containing the words. For instance, in a case where there are documents DocID 1, DocID2,..., and DocID 100, and DocID 1 includes words A, B, C,..., and J, the following inverted files may be generated to search for documents similar to document DocID 1 :
[38] Inverted file of word A: documents DocID 2 and DocID 3
[39] Inverted file of word B : documents DocID 2, DocID 3, DocID 4, and DocID 5
[40] Inverted file of word C: documents DocID 2, DocID 3, DocID 5, DocID 6, and
DocID 7
[41]
[42] Inverted file of word J: documents DocID 2, DocID 3, DocID 5, DocID 7, DocID
10,..., and DocID 85.
[43] After generating the inverted files, the inverted files are analyzed, and documents of frequency less than a threshold value are removed (operation S630). In the above- mentioned embodiment, when the inverted files of words A and B are compared with each other and the inverted file of word C is then compared, document DocID 4 of low appearance frequency is excluded and no longer compared. After the inverted file of word J is compared in this manner, documents DocIDs of low appearance frequency are excluded, thereby greatly reducing the number of documents to be compared with document DocID 1.
[44] Fig. 10 is a text search apparatus on which the significance of information is reflected, according to an embodiment of the present invention.
[45] The text search apparatus includes a web data storage unit 710, a text document storage unit 720, a similarity analyzing unit 730, a representative document extracting unit 740, a similar document extracting unit 750, a searching unit 760, and an information recommendation unit 770.
[46] The web data storage unit 710 collects and stores information existing on the Internet. The text document storage unit 720 stores text documents among the information. The similarity analyzing unit 730 groups the text documents into groups based on the number of words and particles contained in the text documents, generates inverted files with respect to each of the words, removes text documents of frequency less than a predetermined threshold value to select text documents of which similarities are to be calculated, and calculates the similarities between the documents. The representative document extracting unit 740 groups documents regarded as the same text documents into the same groups, and extracts a representative document from each of the groups. As described above, the most recent document or document containing images may be extracted as the representative document. The similar document extracting unit 750 extracts, as similar text documents, documents having similarities more than a predetermined value determined based on similarities calculated by the similarity analyzing unit 730.
[47] When a user enters a search word in the searching unit 760, the searching unit 760 searches the representative document storage unit 740 and the similar document storage unit 750. At this time, a valuable one of the representative documents is placed at the top of a search result page. Information concerning similar documents is linked so that the user can view it. The information recommendation unit 770 outputs valuable information according to a predetermined condition. For instance, information frequently appearing on the Internet is determined as valuable information, and is thus automatically output among representative documents even though the user does not enter the search word. For example, since documents appearing more than one thousand times a day are regarded as an important issue, the documents need to be automatically output.
[48] Fig. 11 is a result obtained from a text search method according to the present invention.
[49] According to the text search method, a sentence of high similarity is placed at the top of an output screen. The similarity is determined based on the above-mentioned method, and a document having a lot of similar sentences is regarded as a relatively significant document. For instance, if "Park Ji-Sung" is entered in a search window, search results are output in order of document significance. As described above, the most significant document is a document of highest appearance frequency. For example, in Fig. 11, the most significant document is an item "Photos in Park's home" 810. If an item "Similar documents" 820 is selected, its detailed content 820-1 is displayed on a new or current window.
[50] The above-mentioned text search method may be written with computer programs.
Codes and code segments constituting the programs can be easily deduced by computer programmers skilled in the art. In addition, the programs are stored in computer readable media, read and executed by computers, thereby implementing the text search method. Examples of the computer readable media include magnetic recording media, optical recording media, and carrier wave media.
[51] While the present invention has been described with reference to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the present invention as defined by the following claims. Industrial Applicability
[52] The present invention can be applied to industrial fields related to an information search method of providing or recommending information to users based on the significance of information.

Claims

Claims
[1] An information search method comprising:
(a) calculating similarities between a plurality of information;
(b) grouping the same information into groups based on the similarities and calculating value of the information based on the number of information that are regarded as substantially the same information; and
(c) displaying information search results on which the value is reflected.
[2] The information search method of claim 1, wherein the information is text information.
[3] The information search method of claim 2, wherein the operation (a) uses titles and main bodies of the information to calculate the similarities between the text information.
[4] The information search method of claim 3, wherein the operation (a) comprises:
(al) dividing the text information into groups based on the number of words and particles contained in the text information;
(a2) generating inverted files with respect to each of the words in the groups; (a3) removing text information of frequencies less than a predetermined threshold value from analysis of the inverted files to select text information of which similarities are to be calculated; and
(a4) calculating similarities between the selected text information and grouping information regarded as substantially the same text information into groups.
[5] The information search method of claim 4, wherein the operation (a4) puts a higher weight value on the title than on the main body to calculate the similarities.
[6] The information search method of any one of claims 1 to 5, wherein the operation (b) groups substantially the same information into groups based on the similarities, and estimates the value of information based on the number of information that is regarded as substantially the same information in each of the groups.
[7] The information search method of any one of claims 1 to 5, wherein the operation (b) groups the same information into groups based on the similarities, and extracts most recent information or information containing images in each of the groups as a representative document of each of the groups.
[8] The information search method of claim 1, wherein the operation (c) receives a search word from a user to conduct an information search, and displays information search results in order of more valuable information.
[9] The information search method of claim 1, wherein the operation (c) displays a representative document of a group containing the most valuable information to a user under a predetermined condition when the user does not enter a search word.
[10] An information search apparatus comprising: a text document storage unit storing text documents among information collected on the Internet; a similarity analyzing unit calculating similarities between the text documents; a representative document extracting unit grouping documents regarded as the same documents into groups based on the similarities and extracting a representative document from each of the groups; a similar document extracting unit extracting documents regarded as similar documents based on the similarities; and a searching unit displaying representative documents and similar documents corresponding to a search word in order of a representative document of higher appearance frequency and providing the similar documents that are linked.
[11] The information search apparatus of claim 10, further including an information recommendation unit regarding, as valuable documents, documents of appearance frequency more than a predetermined value among the representative documents extracted by the representative document extracting unit, and outputting the valuable documents without any request from users.
[12] The information search apparatus of claim 10, wherein the similarity analyzing unit groups the text documents into groups based on the number of words and particles contained in the text documents, generating inverted files with respect to each of the words in the groups, removing text documents of frequencies less than a predetermined threshold value to select text documents of which similarities are to be calculated, calculating similarities between the selected text documents, grouping text documents regarded as the same text document into groups and outputting similar text documents.
[13] The information search apparatus of claim 12, wherein a higher weight value is put on a title than on a main body of a text document to calculate the similarities.
[14] A computer-readable media storing a program configured to execute on a computer the information search method of claim 1.
PCT/KR2006/002758 2005-07-15 2006-07-13 Information search method and information search apparatus on which information value is reflected WO2007011129A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2008521324A JP4896132B2 (en) 2005-07-15 2006-07-13 Information retrieval method and apparatus reflecting information value

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2005-0064495 2005-07-15
KR1020050064495A KR100645614B1 (en) 2005-07-15 2005-07-15 Search method and apparatus considering a worth of information

Publications (1)

Publication Number Publication Date
WO2007011129A1 true WO2007011129A1 (en) 2007-01-25

Family

ID=37654523

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2006/002758 WO2007011129A1 (en) 2005-07-15 2006-07-13 Information search method and information search apparatus on which information value is reflected

Country Status (3)

Country Link
JP (2) JP4896132B2 (en)
KR (1) KR100645614B1 (en)
WO (1) WO2007011129A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009048350A (en) * 2007-08-17 2009-03-05 Nec Corp Apparatus, method and program for evaluating information
WO2020095776A1 (en) * 2018-11-06 2020-05-14 株式会社 東芝 Knowledge information creation assistance device

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5682113B2 (en) * 2010-01-22 2015-03-11 カシオ計算機株式会社 Information display device and program
KR101544142B1 (en) * 2010-04-06 2015-08-17 네이버 주식회사 Searching method and system based on topic
CN102411583B (en) * 2010-09-20 2013-09-18 阿里巴巴集团控股有限公司 Method and device for matching texts
JP5834815B2 (en) * 2011-11-22 2015-12-24 株式会社リコー Information processing apparatus, program, and method for retrieving documents
KR101527198B1 (en) * 2012-01-06 2015-06-09 (주)광개토연구소 Patent Intelligence System and its Method on Making Systemtic Relation on Technological Problems and Technical Solution
JP5921379B2 (en) 2012-08-10 2016-05-24 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Text processing method, system, and computer program.
JP2015092398A (en) * 2015-01-13 2015-05-14 カシオ計算機株式会社 Information display controller and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924090A (en) * 1997-05-01 1999-07-13 Northern Light Technology Llc Method and apparatus for searching a database of records
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
KR20040029895A (en) * 2002-10-02 2004-04-08 씨씨알 주식회사 Search system
KR20040078632A (en) * 2004-08-23 2004-09-10 현인호 Apparatus and method for reconstructuring search research result using search engines
US20060020607A1 (en) * 2004-07-26 2006-01-26 Patterson Anna L Phrase-based indexing in an information retrieval system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010104873A (en) * 2000-05-16 2001-11-28 임갑철 System for internet site search service using a meta search engine
KR100643979B1 (en) * 2000-05-18 2006-11-13 엘지전자 주식회사 Information providing method for information searching result in an internet
JP2003044490A (en) * 2001-07-30 2003-02-14 Toshiba Corp Knowledge analytic system and overlapped knowledge registration setting method for the same
JP4142881B2 (en) * 2002-03-07 2008-09-03 富士通株式会社 Document similarity calculation device, clustering device, and document extraction device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924090A (en) * 1997-05-01 1999-07-13 Northern Light Technology Llc Method and apparatus for searching a database of records
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
KR20040029895A (en) * 2002-10-02 2004-04-08 씨씨알 주식회사 Search system
US20060020607A1 (en) * 2004-07-26 2006-01-26 Patterson Anna L Phrase-based indexing in an information retrieval system
KR20040078632A (en) * 2004-08-23 2004-09-10 현인호 Apparatus and method for reconstructuring search research result using search engines

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009048350A (en) * 2007-08-17 2009-03-05 Nec Corp Apparatus, method and program for evaluating information
WO2020095776A1 (en) * 2018-11-06 2020-05-14 株式会社 東芝 Knowledge information creation assistance device
CN112567364A (en) * 2018-11-06 2021-03-26 株式会社东芝 Knowledge information creation support device

Also Published As

Publication number Publication date
JP4896268B2 (en) 2012-03-14
KR100645614B1 (en) 2006-11-14
JP4896132B2 (en) 2012-03-14
JP2009500764A (en) 2009-01-08
JP2011253572A (en) 2011-12-15

Similar Documents

Publication Publication Date Title
JP5083669B2 (en) Information extraction system, information extraction method, information extraction program, and information service system
WO2007011129A1 (en) Information search method and information search apparatus on which information value is reflected
US20060212441A1 (en) Full text query and search systems and methods of use
KR100898456B1 (en) Method for offering result of search and system for executing the method
CN100595753C (en) Text subject recommending method and device
CN101692223A (en) Refining a search space inresponse to user input
CN108090104B (en) Method and device for acquiring webpage information
JP5442401B2 (en) Behavior information extraction system and extraction method
KR20070009338A (en) Image search method and apparatus considering a similarity among the images
US20170228378A1 (en) Extracting topics from customer review search queries
CN111475725A (en) Method, apparatus, device, and computer-readable storage medium for searching for content
JP2006318398A (en) Vector generation method and device, information classifying method and device, and program, and computer readable storage medium with program stored therein
CN114330329A (en) Service content searching method and device, electronic equipment and storage medium
JP4883644B2 (en) RECOMMENDATION DEVICE, RECOMMENDATION SYSTEM, RECOMMENDATION DEVICE CONTROL METHOD, AND RECOMMENDATION SYSTEM CONTROL METHOD
Boughareb et al. A graph-based tag recommendation for just abstracted scientific articles tagging
Moumtzidou et al. Discovery of environmental nodes in the web
Jain et al. Organizing query completions for web search
JP5679400B2 (en) Category theme phrase extracting device, hierarchical tagging device and method, program, and computer-readable recording medium
Gao et al. Web-based citation parsing, correction and augmentation
WO2006046195A1 (en) Data processing system and method
WO2007011140A1 (en) Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues
WO2024074760A1 (en) Content management arrangement
Alli Result Page Generation for Web Searching: Emerging Research and
Yang A Webpage Classification Algorithm Concerning Webpage Design Characteristics.
JP5485856B2 (en) Browsing log analysis device and browsing log analysis program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
ENP Entry into the national phase

Ref document number: 2008521324

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS EPO FORM 1205A DATED 22.07.2008.

122 Ep: pct application non-entry in european phase

Ref document number: 06769275

Country of ref document: EP

Kind code of ref document: A1