WO2007011129A1

WO2007011129A1 - Information search method and information search apparatus on which information value is reflected

Info

Publication number: WO2007011129A1
Application number: PCT/KR2006/002758
Authority: WO
Inventors: Seung-Jun Lee; Hyung-Gon Kim; Byung-Hak Kim; Seo-Dong Nam; Joong-Ho Shin
Original assignee: Chutnoon Inc.
Priority date: 2005-07-15
Filing date: 2006-07-13
Publication date: 2007-01-25
Also published as: JP4896268B2; KR100645614B1; JP4896132B2; JP2009500764A; JP2011253572A

Abstract

An information search method is provided, including: (a) calculating similarities between a plurality of information; (b) grouping the same information into groups based on the similarities and calculating value of the information based on the number of information that are regarded as substantially the same information; and (c) displaying information search results on which the value is reflected.

Description

INFORMATION SEARCH METHOD AND INFORMATION SEARCH APPARATUS ON WHICH INFORMATION VALUE IS

REFLECTED

Technical Field

[1] The present invention relates to an information search technology and, more particularly, to an information search method and information search apparatus for providing or recommending information to users based on the significance of information. Background Art

[2] With an explosive increase in information providers and users over the Internet, various kinds of information are currently overflowing in the Internet. Accordingly, search engines play an important role in extracting desired one of the information. Conventional search engines are intended to search for more information, while current search engines are required to search for and selectively offer desired information to users. For this purpose, it is necessary to offer information to users based on the significance of information.

[3] A conventional search method calculates a similarity between a search word and a search document. That is, the similarity is calculated based on the number of times the search word appears in the search document. In a case when a search word of "Neowiz" appears ten times in a document and appears five times in another document, the similarities are 100% and 50% in the document and the other document, respectively.

[4] Meanwhile, Boolean search model, extended Boolean search model, vector space model, probability distribution, Poisson model, and Lagrangian model may be used to calculate the similarities. However, since these methods calculate the similarities simply based on the number of times search words appear, they cannot calculate the similarities on which the significance of information is reflected.

[5] Meanwhile, the significance of information may be calculated using hyperlinked web pages. That is, the significance of information is calculated based on the number of Internet links referring to the information. For instance, the more the number of times other sites refer to a search document, the more significant the search document. However, the method cannot be applied to all kinds of information. For instance, since the number of sites linked to Korean documents is relatively smaller than the number of sites linked to English documents, the above-mentioned method cannot be equally applied to the above-mentioned case. Disclosure of Invention Technical Solution

[6] The present invention provides an information search method and information search apparatus for grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups. Advantageous Effects

[7] According to the present invention, since the significance of information is determined based on the number of the same information and information is displayed to users in order of the significance of information, it is possible to provide the users with desired information. In addition, since similar documents are linked, search results are convenient to refer to. Further, since duplicate information is excluded from the search results, it is possible to reduce the time and effort required to check the search results. Brief Description of the Drawings

[8] The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings in which:

[9] Fig. 1 is a view for explaining a method of grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups;

[10] Fig. 2 is a flow chart of a text search method on which the significance of information is reflected, according to an embodiment of the present invention;

[11] Fig. 3 is a detailed flow chart of the text search method of Fig. 2;

[12] Figs. 4 to 6 are a process of extracting a set of index keywords from a text document;

[13] Figs. 7 to 8 are views for explaining a method of calculating similarities between documents with a set of index keywords and searching for the same and similar documents;

[14] Fig. 9 is a flow chart of a method of reducing the number of documents of which similarities are to be calculated;

[15] Fig. 10 is a text search apparatus on which the significance of information is reflected, according to an embodiment of the present invention; and

[16] Fig. 11 is a result obtained from a text search method according to the present invention. Best Mode for Carrying Out the Invention [17] According to an aspect of the present invention, there is provided an information search method including: (a) calculating similarities between a plurality of information; (b) grouping the same information into groups based on the similarities and calculating value of the information based on the number of information that are regarded as substantially the same information; and (c) displaying information search results on which the value is reflected.

[18] The operation (a) may include: (al) dividing the text information into groups based on the number of words and particles contained in the text information; (a2) generating inverted files with respect to each of the words in the groups; (a3) removing text information of frequencies less than a predetermined threshold value from analysis of the inverted files to select text information of which similarities are to be calculated; and (a4) calculating similarities between the selected text information and grouping information regarded as substantially the same text information into groups.

[19] The operation (a4) may put a higher weight value on the title than on the main body to calculate the similarities.

[20] According to another aspect of the present invention, there is provided an information search apparatus including: a text document storage unit storing text documents among information collected on the Internet; a similarity analyzing unit calculating similarities between the text documents; a representative document extracting unit grouping documents regarded as the same documents into groups based on the similarities and extracting a representative document from each of the groups; a similar document extracting unit extracting documents regarded as similar documents based on the similarities; and a searching unit displaying representative documents and similar documents corresponding to a search word in order of a representative document of higher appearance frequency and providing the similar documents that are linked. Mode for the Invention

[21] Exemplary embodiments in accordance with the present invention will now be described in detail with reference to the accompanying drawings.

[22] Fig. 1 is a view for explaining a method of grouping information containing the same content into groups, extracting representative information from each of the groups, and offering the information to users based on the significance of information of each of the groups.

[23] Information collected on the Internet are grouped into groups each having the same content. The term 'same content' does not imply 'exactly identical content' but 'content having a similarity more than a predetermined threshold value', i.e., 'substantially the same content.' That is, in a case where some sites commonly have information having the same content with respect to a search word, the information is grouped into a group. For instance, in case of a search word of "Neowiz", there may be several Internet sites containing the content that "...there appears a new search engine that can search for all the information on the Internet. One of search service companies, 'lnoon (http://www.lnoon.com)' which has been established by a second large stockholder of Neowiz, Byung-Kyu Chang, has recently developed a search engine that allows all the users to conveniently search for the Internet..." A group A IlO may include the above- mentioned information. A group B 120 may include a set of information containing the content that "...[Neowiz/Sayclub] Introduction to Neowiz and E-community..." A group C 130 may include a set of information containing the content that "...Management of Neowiz, Card, Casual, Mobile, Go-Stop game..."

[24] That is, a plurality of information having the same content is grouped into a group.

Representative information is extracted from the group and displayed to a user. The representative information implies information representative of the group, and may be the most recent information or information containing images in the group.

[25] Search results are displayed to the user based on the number of times information containing the search word appears, such that valuable information is easily found by the user.

[26] Fig. 2 is a flow chart of a text search method on which the significance of information is reflected, according to an embodiment of the present invention.

[27] First, desired information is collected (operation S210), and a similarity between the information is calculated (operation S220). Conventionally, in a case when a hundred pieces of information are collected, 100 100 calculations are required to obtain similarities between the hundred pieces of information. A method of calculating the similarity will be described in detail with reference to Figs. 3, 4 to 8. After the similarity calculation, a plurality of information having the same content is grouped into a group, duplicate information is removed, and representative information is extracted (operation S230). The significance of information is calculated based on the number of substantially the same information (operation S240). The representative information is output based on the significance of group (operation S250). Representative information of a group of high appearance frequency of information is regarded as information with high significance, such that the representative information is placed at the top of an output screen or is highlighted on the output screen.

[28] Fig. 3 is a detailed flow chart of the text search method shown in Fig. 2.

[29] A method of calculating a similarity between text documents and providing its search results will be described. Index keywords are extracted from texts in documents to calculate similarities between the documents (operation S310). The keywords are compared with each other to calculate the similarities between the documents (operation S320). The more the same index keywords appear, the more similar the documents are presumed to be. The similarities may be calculated by assigning different weight values to a title and a main body of the document. For instance, when both documents have a lot of keywords similar to each other in their titles, there is a good possibility that both of the documents are similar to each other. Thus, the weight value may be assigned to the title upon calculating the similarities. As a result, the same and similar documents are determined based on the similarities (operation S330). A representative document is extracted from each of the groups (operation S340). The representative document is provided to users based on its significance (operation S350).

[30] Figs. 4 to 6 are a process of extracting a set of index keywords from a text document.

[31] Referring to Fig. 4, a document 410 consists of a word string 401 with respect to a title, and a word string 402 with respect to a main body. Referring to Fig. 5, it is assumed that the title is "Search Business Separated from Neowiz" 421, and the main body is "A new service provider, lnoon, separated from Neowiz launches services on a full scale. The lnoon (lnoon.com) expects to conduct a beta test as early as next month and then commence formal services on this coming October. From this year,..." Referring to Fig. 6, a set of index keywords 430 includes 'Neowiz, Search, and Separated' that are extracted as keywords for the title, and 'Neowiz, Separated, Search, lnoon, Test, and Commence' that are extracted as keywords for the main body.

[32] Figs. 7 and 8 are views for explaining a method of calculating similarities between documents with a set of index keywords and searching for the same and similar documents.

[33] Fig. 7 is a view for explaining a similarity comparison with reference to Figs. 4 to 6.

Similarities between documents A and B, between documents A and C, and between documents A and D are 75%, 4%, and 96%, respectively. The similarities can be calculated according to the above-mentioned methods. For instance, the similarities can be calculated by comparing the keywords for title and for content under the same condition, or by putting a higher weight value on the keyword for title.

[34] Fig. 8 is a view for explaining a method of searching for documents identical and similar to each of documents based on the similarities shown in Fig. 7. A reference similarity used to determine the same and similar documents may vary. Fig. 8 shows that the number of the same documents as document A is twenty five, the same documents are documents B, D and so on, and similar documents are documents X, T and so on.

[35] Fig. 9 is a flow chart of a method of reducing the number of documents of which similarities are to be calculated. [36] It takes a large amount of calculation and a lot of time to make an index keyword list of all documents and to calculate similarities between the documents. Thus, it is necessary to reduce the number of the documents of which similarities are to be calculated. For this purpose, documents are first grouped into groups based on the number of words and particles constituting the documents (operation S610). If documents are similar to one another in the number of words and particles constituting documents, there is a good possibility that the documents are similar to one another. Thus, the documents are grouped into the same groups. A reference for grouping may vary. For instance, a document may be grouped into the same group every five words and particles, or every different number of words and particles.

[37] An inverted file is generated for each of the groups (operation S620). The inverted file is generated by extracting words constituting the documents and collecting IDs of documents containing the words. For instance, in a case where there are documents DocID 1, DocID2,..., and DocID 100, and DocID 1 includes words A, B, C,..., and J, the following inverted files may be generated to search for documents similar to document DocID 1 :

[38] Inverted file of word A: documents DocID 2 and DocID 3

[39] Inverted file of word B : documents DocID 2, DocID 3, DocID 4, and DocID 5

[40] Inverted file of word C: documents DocID 2, DocID 3, DocID 5, DocID 6, and

DocID 7

[41]

[42] Inverted file of word J: documents DocID 2, DocID 3, DocID 5, DocID 7, DocID

10,..., and DocID 85.

[43] After generating the inverted files, the inverted files are analyzed, and documents of frequency less than a threshold value are removed (operation S630). In the above- mentioned embodiment, when the inverted files of words A and B are compared with each other and the inverted file of word C is then compared, document DocID 4 of low appearance frequency is excluded and no longer compared. After the inverted file of word J is compared in this manner, documents DocIDs of low appearance frequency are excluded, thereby greatly reducing the number of documents to be compared with document DocID 1.

[44] Fig. 10 is a text search apparatus on which the significance of information is reflected, according to an embodiment of the present invention.

[45] The text search apparatus includes a web data storage unit 710, a text document storage unit 720, a similarity analyzing unit 730, a representative document extracting unit 740, a similar document extracting unit 750, a searching unit 760, and an information recommendation unit 770.

[46] The web data storage unit 710 collects and stores information existing on the Internet. The text document storage unit 720 stores text documents among the information. The similarity analyzing unit 730 groups the text documents into groups based on the number of words and particles contained in the text documents, generates inverted files with respect to each of the words, removes text documents of frequency less than a predetermined threshold value to select text documents of which similarities are to be calculated, and calculates the similarities between the documents. The representative document extracting unit 740 groups documents regarded as the same text documents into the same groups, and extracts a representative document from each of the groups. As described above, the most recent document or document containing images may be extracted as the representative document. The similar document extracting unit 750 extracts, as similar text documents, documents having similarities more than a predetermined value determined based on similarities calculated by the similarity analyzing unit 730.

[47] When a user enters a search word in the searching unit 760, the searching unit 760 searches the representative document storage unit 740 and the similar document storage unit 750. At this time, a valuable one of the representative documents is placed at the top of a search result page. Information concerning similar documents is linked so that the user can view it. The information recommendation unit 770 outputs valuable information according to a predetermined condition. For instance, information frequently appearing on the Internet is determined as valuable information, and is thus automatically output among representative documents even though the user does not enter the search word. For example, since documents appearing more than one thousand times a day are regarded as an important issue, the documents need to be automatically output.

[48] Fig. 11 is a result obtained from a text search method according to the present invention.

[49] According to the text search method, a sentence of high similarity is placed at the top of an output screen. The similarity is determined based on the above-mentioned method, and a document having a lot of similar sentences is regarded as a relatively significant document. For instance, if "Park Ji-Sung" is entered in a search window, search results are output in order of document significance. As described above, the most significant document is a document of highest appearance frequency. For example, in Fig. 11, the most significant document is an item "Photos in Park's home" 810. If an item "Similar documents" 820 is selected, its detailed content 820-1 is displayed on a new or current window.

[50] The above-mentioned text search method may be written with computer programs.

Codes and code segments constituting the programs can be easily deduced by computer programmers skilled in the art. In addition, the programs are stored in computer readable media, read and executed by computers, thereby implementing the text search method. Examples of the computer readable media include magnetic recording media, optical recording media, and carrier wave media.

[51] While the present invention has been described with reference to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the present invention as defined by the following claims. Industrial Applicability

[52] The present invention can be applied to industrial fields related to an information search method of providing or recommending information to users based on the significance of information.

Claims

[1] An information search method comprising:

(a) calculating similarities between a plurality of information;

(b) grouping the same information into groups based on the similarities and calculating value of the information based on the number of information that are regarded as substantially the same information; and

(c) displaying information search results on which the value is reflected.

[2] The information search method of claim 1, wherein the information is text information.

[3] The information search method of claim 2, wherein the operation (a) uses titles and main bodies of the information to calculate the similarities between the text information.

[4] The information search method of claim 3, wherein the operation (a) comprises:

(al) dividing the text information into groups based on the number of words and particles contained in the text information;

(a2) generating inverted files with respect to each of the words in the groups; (a3) removing text information of frequencies less than a predetermined threshold value from analysis of the inverted files to select text information of which similarities are to be calculated; and

(a4) calculating similarities between the selected text information and grouping information regarded as substantially the same text information into groups.

[5] The information search method of claim 4, wherein the operation (a4) puts a higher weight value on the title than on the main body to calculate the similarities.

[6] The information search method of any one of claims 1 to 5, wherein the operation (b) groups substantially the same information into groups based on the similarities, and estimates the value of information based on the number of information that is regarded as substantially the same information in each of the groups.

[7] The information search method of any one of claims 1 to 5, wherein the operation (b) groups the same information into groups based on the similarities, and extracts most recent information or information containing images in each of the groups as a representative document of each of the groups.

[8] The information search method of claim 1, wherein the operation (c) receives a search word from a user to conduct an information search, and displays information search results in order of more valuable information.

[9] The information search method of claim 1, wherein the operation (c) displays a representative document of a group containing the most valuable information to a user under a predetermined condition when the user does not enter a search word.

[10] An information search apparatus comprising: a text document storage unit storing text documents among information collected on the Internet; a similarity analyzing unit calculating similarities between the text documents; a representative document extracting unit grouping documents regarded as the same documents into groups based on the similarities and extracting a representative document from each of the groups; a similar document extracting unit extracting documents regarded as similar documents based on the similarities; and a searching unit displaying representative documents and similar documents corresponding to a search word in order of a representative document of higher appearance frequency and providing the similar documents that are linked.

[11] The information search apparatus of claim 10, further including an information recommendation unit regarding, as valuable documents, documents of appearance frequency more than a predetermined value among the representative documents extracted by the representative document extracting unit, and outputting the valuable documents without any request from users.

[12] The information search apparatus of claim 10, wherein the similarity analyzing unit groups the text documents into groups based on the number of words and particles contained in the text documents, generating inverted files with respect to each of the words in the groups, removing text documents of frequencies less than a predetermined threshold value to select text documents of which similarities are to be calculated, calculating similarities between the selected text documents, grouping text documents regarded as the same text document into groups and outputting similar text documents.

[13] The information search apparatus of claim 12, wherein a higher weight value is put on a title than on a main body of a text document to calculate the similarities.

[14] A computer-readable media storing a program configured to execute on a computer the information search method of claim 1.