CN1858737B

CN1858737B - Method and system for data searching

Info

Publication number: CN1858737B
Application number: CN2006100027599A
Authority: CN
Inventors: 朱鹏喜
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2006-01-25
Filing date: 2006-01-25
Publication date: 2010-06-02
Anticipated expiration: 2026-01-25
Also published as: WO2007085187A1; CN1858737A

Abstract

This invention provides a method for searching data including: generating a first index file against all information to be searched, generating an independent second index file against each piece of information to get related information based on the input key words. When carrying out advanced search, the method also includes: analyzing the input enquiry conditions to search a first sub-conditionin the first index file to get the identification meeting the information of the first sub-condition, searching in the second index file corresponding to the information of the first based on the second sub-condition. This invention owns multiple second index files so it's easy to apply multi-task to enquire coordinately.

Description

A kind of method and system of data search

Technical field

The present invention relates to the data search field, particularly relate to a kind of generation method and improve the method and the search engine system of the data search of search speed by the improvement index file.

Background technology

Along with developing rapidly of INTERNET, electronic information is constantly grown in strength and is enriched, yet these information but are to be dispersed on the server of numerous node, for domestic consumer, how can find the information that oneself needs rapidly accurately, be a major issue that presses for solution.Search engine is just for having erected the bridge of linking up between user and the information source.

Search engine (SEARCH ENGINE) utilizes automatic capture program, WEBCRAWLERS for example, SPIDER, ROBOT, go up each node of traversal at wide area network (INTERNET) or LAN (Local Area Network) (INTRANET), use global search technology that the information that grasps on each node is analyzed, the line index of going forward side by side, classify, set up corresponding database, preserve a infotech in order to user inquiring.Its ultimate principle is from one group of known document, summary by this document is connected definite new information point with super chain, travel round these information points by the traversal program of search engine then, the document on these information points is carried out index, classification and be organized in the index data base going.Finally can add all information in the index database by this recurrence traversal logically.When the user uses search engine, the input key word, search program just reads information in index data base and user key words mates, and retrieves corresponding or relevant information and by certain organizational form it is exported to the user.

Search engine generally is retrieved as the basis in full, and full-text search is generally set as data structure with B, and information in which type of mode is stored in the data structure, and can effectively reduce search time, is the problem that various search engines all will be considered.The data structure (as Fig. 1) of more employing B tree is stored index information in the present search engine, each index node is stored the information of each " speech ", and be connected on the address information id of the document that contains this " speech ", the word frequency ordering that the document id that contains this " speech " occurs according to speech is connected on the index node with the form of single-track link table or doubly linked list.Described " speech " is meant in search engine, the minimum unit of energy expressing information.

Fig. 1 shows search engine some documents has been set up structural drawing behind the index, all information of all documents all are kept in the B tree, each node on the B tree is exactly a speech, wherein speech " I " has pointed to the another one chained list, chained list has been preserved the number of times that " I " occur in all documents of containing " I " and the document, what contain speech " I " in this figure has document 10298, a document 786; Speech " China " has pointed to the another one chained list, and chained list has been preserved the number of times that " China " occurs in all documents of containing " China " and the document, and what contain speech " China " in this figure has document 10298, a document 786; The number of times of described appearance adopts word frequency to represent, the number of times that " I " occur in the document 786 is 4 times, and the number of times that " China " occurs in the document 786 is 1 time.

When the index file that adopts said method to generate is searched for, especially in advanced search, for example in the retrieval that contains computings such as AOI, when document id tabulation the carrying out AOI of correspondence is operated, can have a strong impact on the speed of search.For example: under the superincumbent index file mode, the user need search for had both contained " I ", the document that contains " China " again, input key word, the document that then can very fast retrieving contains " I " are 10298 and document 786, and what contain " China " has document 786 and a document 26543, the document that contains " I " and " China " so simultaneously is (10298,786) and the common factor of (786,26543), the process of calculating this common factor can have a strong impact on the speed of search.

Summary of the invention

In view of the above problems, the technical issues that need to address of the present invention provide a kind of generation method of index file, can improve the search speed when carrying out advanced search.

Another object of the present invention is to provide a kind of method of carrying out data search based on the index file of above-mentioned generation, and a kind of search engine system, can improve the search speed when carrying out advanced search.

For solving the problems of the technologies described above, the objective of the invention is to be achieved through the following technical solutions:

The invention provides a kind of method of search data, may further comprise the steps: the full detail document at the needs retrieval generates first index file, and described first index file writes down the situation of the word that each information document contains in the described full detail document; Generate independently second index file at each information document in the described full detail document, described second index file writes down the number of times that the word that contains in the corresponding information document and this speech occur, the generation step of described second index file comprises the steps A to F:A, judges whether an information document contains next word, if not then finish the generation step of this second index file, if contain, execution in step B then; B, judge in the B of this second index file tree whether have the node of this speech correspondence, if exist, execution in step C then, if there is no, execution in step F then; C, judge in the chained list that this speech points to whether have the sign of this information document, if exist, execution in step D then, if there is no, execution in step E then; The pairing word frequency of this information document adds 1 in D, the chained list that this speech is pointed to; E, in the chained list that this speech points to, increase the sign of this information document, and the word frequency that it is corresponding is made as 1; F, increase in the B of this second index file tree node that should speech, and increase the sign of this information document in chained list, putting word frequency is 1; When carrying out advanced search: the querying condition that decomposes input; First sub-condition of retrieval obtained the sign that satisfies first subconditional information document in first index file; According to second sub-condition, in satisfying pairing second index file of first subconditional information document, retrieve, obtain the relevant information document.

Preferably, in the method for described search data,, can adopt multitask in described second index file that satisfies first subconditional information document correspondence, to retrieve respectively if when satisfying first subconditional information document greater than one.

Preferably, the method for described search data can also comprise: if there is next sub-condition, then according to next sub-condition, retrieve in pairing second index file of subconditional information document on satisfying.

Preferably, described first index file or second index file can adopt the B tree as data structure.

Preferably, the method for described search data, all right: the number of times that occurs in information document according to querying condition sorts to result for retrieval and exports this result for retrieval, obtains preferable client's feedback.

The invention also discloses a kind of search system, comprising:

Be used for generating at the full detail document of needs retrieval the module of first index file, described first index file writes down the situation of the word that each information document contains in the described full detail document;

Be used for generating the independently module of second index file at each information document of described full detail document, described second index file writes down the number of times that the word that contains in the corresponding information document and this speech occur; Described each information document that is used at described full detail document generates independently the module of second index file and comprises that following modules A is to F:

The modules A that is used for execution in step A, steps A be for to judge whether an information document contains next word, if not then finish the generation step of this second index file, if contain, then enters step B;

The module B that is used for execution in step B, step B if exist, then enters step C for judging the node that whether has this speech correspondence in the B of this second index file tree, if there is no, then enters step F;

The module C that is used for execution in step C, step C be for whether there being the sign of this information document in the chained list of judging this speech and pointing to, if existence then enters step D, if there is no, execution in step E then;

The module D that is used for execution in step D, step D adds 1 for the pairing word frequency of this information document in the chained list that this speech is pointed to;

The module E that is used for execution in step E, step e is the sign that increases this information document in the chained list that this speech points to, and the word frequency that it is corresponding is made as 1;

The module F that is used for execution in step F, step F is the node that increases in the B of this second index file tree should speech, and increases the sign of this information document in chained list, putting word frequency is 1;

Be used for when carrying out advanced search, decompose the module of the querying condition of input; Be used for obtaining the module of the sign that satisfies first subconditional information document in first sub-condition of first index file retrieval; Be used for according to second sub-condition, in satisfying pairing second index file of first subconditional information document, retrieve, obtain the module of relevant information document.

Compared with prior art, technical scheme of the present invention as can be seen has the following advantages:

Because the present invention sets up the index file at all information except adopting, simultaneously each information document is all generated independently index separately, when carrying out advanced search, the multitask coordinated inquiry of easier employing is so can improve the speed of single or concurrent inquiry.

Existing Technology Need solves under operating system physical disk addressing condition of limited, the storage of super large index file and access problem.And the present invention has reduced the operation of this respect, and each information document is all generated independently index separately, so can reduce internal memory that this respect brings and the consumption of time, improves the speed of search.

When prior art is carried out quadratic search or advanced search, all load big index file to internal memory at every turn, the present invention then only needs to load once big index file, i.e. first index file, call independently index file then at single document, this index file data volume is less, so can save physical overhead.And, because less, so the time of search just can be very fast, and the process of the computing information common factor that need not carry out expending time in long, so the search speed can improve advanced search the time at the quantity of information of the independently index file of single document.

Description of drawings

Fig. 1 is the structural drawing of the index file of prior art generation;

Fig. 2 is the present invention generates first index file to all information of needs search a flow chart of steps;

Fig. 3 the present invention is directed to single document to generate the independently flow chart of steps of index file;

Fig. 4 is the structural drawing that the present invention is directed to the independently index file of single document generation;

Fig. 5 is the flow chart of steps of a kind of data retrieval method of the present invention.

Embodiment

Core concept of the present invention is: the present invention is except setting up at all information the index file, simultaneously each information document is all generated independently index separately, so can adopt multitask coordinated inquiry, can save physics consumption and improve the speed of retrieval.

With reference to Fig. 2, be the present invention generates first index file to all information of needs search flow chart of steps.

The information that need retrieve of the present invention generally comprises the electronic information on the server that is dispersed in numerous node, has a lot of storage formats in the reality, for example: information storage meanss such as document, webpage or data-base recording.The present invention needs the file layout of the information of retrieval to describe, and plants the information of form but the present invention is not limited to this.

Owing to need repeat each webpage is analyzed, thereby the situation of the speech that each webpage is contained all is recorded in first index file, forms a bigger B of data volume and sets, so generally also can be referred to as big index file.

Index file need adopt certain data structure that index information is stored, and for example, the B tree is all adopted in general full-text search.Data structure is the term that extensively is used on whole Computer Science and Technology field.It is used for reflecting that the inside of data constitutes, and promptly data are made of those compositional datas, how to constitute, and are what structure.Data structure is the form that data exist, and generally can reflect the logical relation between the compositional data.Data structure is a kind of organizational form of information, its objective is that it is corresponding with the set of one group of algorithm usually in order to improve the efficient of algorithm, can carry out certain operation to the data in the data structure by this group algorithm set.It is exactly in order to obtain higher search speed that the data structure of B tree is adopted in general full-text search.Certainly, the present invention does not limit the data structure of index file, and other any feasible data structures all are feasible, for example: data structures such as binary chop tree BST tree, balanced tree Adelson-Velskii-Landis tree, heap.

For the clear process that generates index file at the webpage of needs retrieval of describing, Fig. 2 is an example with the B tree construction, shows a kind of steps flow chart that generates index file.Below this is described in detail:

(1) judge whether a webpage contains next word, if not then finish analysis at this webpage, if contain, execution in step (2) then;

(2) judge in the B of first index file tree whether have the node of this speech correspondence, if exist, execution in step (3) then, if there is no, execution in step (6) then;

(3) judge in the chained list that this speech points to whether have the sign (ID) of this webpage, if exist, execution in step (4) then, if there is no, execution in step (5) then;

(4) the pairing word frequency of this webpage adds 1 in the chained list that this speech is pointed to;

(5) sign of this webpage of increase in the chained list that this speech points to, and the word frequency that it is corresponding is made as 1;

(6) increase in the B of this first index file tree node that should speech, and increase the sign of this webpage in chained list, putting word frequency is 1.

(7) circulation above-mentioned steps all stops circulation after by analysis up to all webpages that need retrieve, and the index information of all webpages all is stored in first index file.

Provide a kind of method of generation first index file above, certainly, adopted other methods well known to those skilled in the art can realize that also the present invention is not limited this.And described index file can adopt different data structures, then also will have different generation steps, owing to belong to technology known in the art, does not repeat them here.

After finishing step shown in Figure 2, just will need all information of all webpages of retrieving all to be stored in above as shown in Figure 1 the B tree, each node on the B tree is exactly a speech, for example: speech " I " has pointed to the another one chained list, and chained list has been preserved the number of times that " I " occur in all webpages of containing " I " and this webpage.

Generated big index file in the prior art, just can retrieve according to the key word of input, but in order to improve the speed when the advanced search, the generation method of index file of the present invention also comprises step S2 after generating first index file: generate independently second index file at each webpage.Described second index file is separate, and the corresponding second own index file of each webpage, one second index file are exactly that the word that contains in the corresponding webpage and the number of times of this speech appearance are described and record.If still adopt the B tree construction, then one second index file is exactly a B tree, and associates with the sign of first index file by the webpage of correspondence.

With reference to Fig. 3, be example with the B tree exactly, show at each webpage and generate the independently steps flow chart of second index file.

(1) judge whether a webpage contains next word, if not then finish generation step at second index file of this webpage, if contain, execution in step (2) then;

(2) judge in the B of this second index file tree whether have the node of this speech correspondence, if exist, execution in step (3) then, if there is no, execution in step (6) then;

(6) increase in the B of this second index file tree node that should speech, and increase the sign of this webpage in chained list, putting word frequency is 1.

After finishing above-mentioned generation step at a webpage, just formed index information storage node composition shown in Figure 4, Fig. 4 the present invention is directed to the independently structural drawing of second index file that single webpage generates.

First index file of the present invention or second index file, preferably, can adopt the B tree as data structure, can certainly adopt the data storage form of other data structures well known to those skilled in the art as index file, the present invention is not limited this.Having provided the step that generates first and second index files above, is for realizability of the present invention is described.A core point of the present invention is, on the basis that generates big index file, also generate oneself independently second index file at each webpage, thereby the employing multitask (multi-process) that can be more prone to is inquired about a plurality of index files simultaneously, and avoided repeatedly calling the first huge index file of data volume, so can improve the speed of data retrieval, especially when carrying out advanced search.Described advanced search generally is applicable to carries out more careful accurate retrieval to webpage, and search condition can comprise single condition or integrated condition.Integrated condition generally is a plurality of subconditional combinations, and wherein each sub-condition all is a single condition, between the sub-condition can by " with ", " or ", " non-" be connected.

With reference to Fig. 5, the invention provides and a kind ofly carry out the method for data retrieval based on the method for above-mentioned generation index file, may further comprise the steps: the full detail at the needs retrieval generates first index file; Generate independently second index file at each information; Key word according to input obtains relevant information.

When generally retrieving, according to the querying condition of input, retrieve the relevant information that first index file obtains needing and get final product, can also comprise that the number of times that occurs according to key word sorts to result for retrieval and exports this result in document.

When carrying out advanced search, can also comprise: the querying condition that decomposes input; First sub-condition of retrieval obtained the sign that satisfies first subconditional information in first index file; According to second sub-condition, in pairing second index file of the information that satisfies first condition, retrieve.If when satisfying first subconditional information, adopt multitask in second index file of described information correspondence, to retrieve respectively greater than one.

For example: after finishing the generation step of index file shown in Figure 3, obtained B tree shown in Figure 4, then when carrying out advanced search, can earlier the querying condition of importing be resolved into a plurality of sub-conditions, first sub-condition of retrieval in big index file (first index file), be met the sign (ID) of first subconditional all documents, at this moment, carry out at second subconditional retrieval in the little index file (second index file) of document identification correspondence successively or simultaneously, the rest may be inferred.If there is next sub-condition, then, on satisfying, retrieves in pairing second index file of subconditional document, thereby obtain comparatively accurate Query Result according to next sub-condition.Preferably, described first index file or second index file can adopt the B tree as data structure.

When second index file is retrieved, can retrieve second index file that satisfies a last subconditional document correspondence successively, also can adopt multitask to retrieve simultaneously.Existing searching method is owing to only exist an index file, so when retrieving at two or more sub-conditions, can only the antithetical phrase condition rank, call this index file successively, and need to calculate the common factor that satisfies each subconditional ensemble of communication, just can be met the result of comprehensive inquiry condition, repeatedly call first huge index file of data volume and computing information intersection of sets collection and cause search speed slower.Because the present invention has changed prior art and has only had the situation of an index file, and a plurality of second index files are provided, and can adopt multi-process to retrieve easily, so can accelerate retrieval rate.Like this, each index file that loads is less, does not need each all first big index files of loading data amount, need not take too big internal memory, and omitted to each conditional information retrieval to literary composition retaining ID merge the process of getting common factor, so can save retrieval time the raising retrieval rate greatly.

Such as when search contains the document of " I " and " China " simultaneously, search engine finds all documents that contain " I " earlier in big index file, it is document 10298,786, play 2 tasks (multi-process or multithreading) key word of search " China " in the index file of document 10298 and document 786 respectively then, for single document 10298 or the like, index information is fewer, the time of search just seldom, can not contain " China " speech in very fast definite document 10298, containing in the document 786 " China ", is 786 so contain the document of " I " and " China " simultaneously.

The method of described retrieve data can also comprise that the number of times that occurs according to key word sorts to result for retrieval and exports this result in document.In order to better meet the needs of user data retrieval, it still is not enough only exporting the document that meets querying condition, also needs document and key word are carried out degree of correlation evaluation, and the result that will export is sorted, and realizes user feedback mechanisms preferably.Setting up one is the inverted file of unit with the speech, constantly mate with term,, determine the degree of correlation of the inquiry of article in document according to frequency and the probability that the keyword of user inquiring occurs, the article that comprises this class term is sorted the output result for retrieval.

The present invention also provides a kind of search engine, can comprise with lower unit:

Search module is used to obtain information.Search module can be implemented in online discovery and gather information by the respective code of program, and grasps to analyze according to web page interlinkage and join in the database, realizes obtaining of electronic information.

Index module, the information of obtaining at search module generates first index file; Generate independently second index file at each information.Index module is mainly used in to understand the information that search module collected and therefrom extract index entry and generates corresponding describe and expressing information is represented document, set up the concordance list of document, form unified physics index data base, thereby can realize the structuring of unstructured information.In general, the height of the quality of a search engine depends primarily on the science and the validity of index, and high more retrieval quality of the quality of index file and speed too can be high more, and the performance of this engine is just good more so.The present invention is just to the improvement of index module, on the basis that generates an index file, generate independently second index file at single document, can reduce the number of times that calls first index file and can adopt multitask to retrieve, thereby improved the speed of retrieval.

Enquiry module is inquired about according to the key word of input, and the output Query Result.Detect document rapidly in the inquiry indexed file of enquiry module according to the user, carry out the degree of correlation evaluation of document and key word, and the result that will export is sorted, realize user feedback mechanisms preferably.Enquiry module is by each speech in the scanned document, setting up one is the inverted file of unit with the speech, constantly mate with term, the frequency and the probability that in document, occur according to the keyword of user inquiring, determine the degree of correlation of the inquiry of article, the article that comprises this class term is sorted the output result for retrieval.

Interface unit is used to import key word and shows Query Result.

Preferably, described enquiry module is all right: the querying condition that decomposes input; First sub-condition of retrieval obtained the sign that satisfies first subconditional information in first index file; According to second sub-condition, in satisfying pairing second index file of first subconditional information, retrieve.

More than the method and system of a kind of data search provided by the present invention is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. the method for a search data is characterized in that, comprising:

Full detail document at the needs retrieval generates first index file, and described first index file writes down the situation of the word that each information document contains in the described full detail document;

Generate independently second index file at each information document in the described full detail document, described second index file writes down the number of times that the word that contains in the corresponding information document and this speech occur; The generation step of described second index file comprises the steps A to F:

A, judge whether an information document contains next word, if not then finish the generation step of this second index file, if contain, execution in step B then;

B, judge in the B of this second index file tree whether have the node of this speech correspondence, if exist, execution in step C then, if there is no, execution in step F then;

C, judge in the chained list that this speech points to whether have the sign of this information document, if exist, execution in step D then, if there is no, execution in step E then;

The pairing word frequency of this information document adds 1 in D, the chained list that this speech is pointed to;

E, in the chained list that this speech points to, increase the sign of this information document, and the word frequency that it is corresponding is made as 1;

F, increase in the B of this second index file tree node that should speech, and increase the sign of this information document in chained list, putting word frequency is 1;

When carrying out advanced search, decompose the querying condition of input; First sub-condition of retrieval obtained the sign that satisfies first subconditional information document in first index file; According to second sub-condition, in satisfying pairing second index file of first subconditional information document, retrieve, obtain the relevant information document.

2. the method for search data as claimed in claim 1, it is characterized in that, if when satisfying first subconditional information document, adopt multitask in described second index file that satisfies first subconditional information document correspondence, to retrieve respectively greater than one.

3. the method for search data as claimed in claim 1 is characterized in that, also comprises:

If there is next sub-condition, then, on satisfying, retrieve in pairing second index file of subconditional information document according to next sub-condition.

4. the method for the search data described in claim 1,2 or 3 is characterized in that, described first index file or second index file adopt the B tree as data structure.

5. the method for search data as claimed in claim 4 is characterized in that, also comprises:

The number of times that occurs in information document according to querying condition sorts to result for retrieval and exports this result for retrieval.

6. a search system is characterized in that, comprising: