CN102779160A

CN102779160A - Mass data information indexing system and indexing construction method

Info

Publication number: CN102779160A
Application number: CN2012101997297A
Authority: CN
Inventors: 安旭; 叶嘉明; 陈德全
Original assignee: CENTRIN DATA SYSTEMS CO LTD
Current assignee: China Gold Data Systems Co., Ltd.; Yantai gold Data System Co., Ltd.
Priority date: 2012-06-14
Filing date: 2012-06-14
Publication date: 2012-11-14
Anticipated expiration: 2032-06-14
Also published as: CN102779160B

Abstract

The invention relates to a mass data information indexing system and a construction method. The system comprises a data distribution server fleet, a construction data indexing server fleet, an indexing server fleet and a combination indexing result server fleet. The data distribution server fleet comprises a plurality of data distribution servers and is used for splitting data requiring to be constructed and distributing the data to construction data indexing servers. The construction data indexing server fleet comprises a plurality of construction data indexing servers and is used for receiving data distributed by the data distribution servers and constructing indexes for the data. The indexing server fleet comprises a plurality of indexing servers and is used for receiving indexes constructed by the construction data indexing servers and indexing the data indexes. The combination indexing result server fleet comprises a plurality of combination indexing result servers and is used for receiving indexing conditions and receiving and combining results indexed by the indexing servers. The system distributes the servers according to functions required to be finished in the indexing process to avoid resource robbing. Error servers can be quickly located according to error reasons in the indexing process once errors happen. The system is convenient to maintain and reduces maintenance and using cost.

Description

Mass data information index system and index structuring method

Technical field

The present invention relates to a kind of data directory system and construction method, especially a kind of mass data information index system and index structuring method.

Background technology

Along with development of technology and development, data volume is increasing, the especially appearance of cloud notion, and concentrated data volume is huger.In order in vast as the open sea data, to find the particular data that needs fast, index has great significance.

Chinese patent document CN101576915B discloses a kind of distributed B+tree index system and construction method, and is concrete, comprises master server, a task server group of planes and an index service group of planes and version control server; A task server group of planes comprises a plurality of task servers, and index server cluster comprises a plurality of index servers; Master server is in charge of the META data, and index server cluster is carried out load balance scheduling; A task server group of planes is responsible for the affairs control to the visit of distributed file system index data; The index data of distributed file system is in charge of and is read and write to index server cluster, realized the transaction functionality of index data under the concurrent environment effectively.

Disclosed index technology just is established to index in one or more index database server in the above-mentioned patent documentation, sets up index and search index and all in one or more index server, carries out.Set up index task and search index task and in one or many s' index database, take place, may cause robbing problem of resource, thereby cause the index server inadequate resource, influence retrieval or set up the efficient of index.And in a single day the process of retrieval goes wrong, and can't judge that also concrete which link of tangible retrieval is out of joint, and the reparation difficulty is big.

Summary of the invention

For this reason, to be solved by this invention is that index function is carried out the retrieval server inadequate resource and the big technical matters of reparation difficulty that brings in one or more index server, and a kind of mass data information index system and index structuring method are provided.

For solving the problems of the technologies described above, the technical scheme that the present invention adopts is following:

A kind of mass data information index system comprises,

A data distributing server group of planes comprises many data distributing servers, and the data that are used for that needs are created split and distribution;

Create a data directory server group of planes, comprise that many are created the data directory server, each said establishment data directory server receives the data of said data distributing server distribution respectively and is said data creation index;

A retrieval server group of planes comprises many retrieval servers, receives said index that said establishment data directory server creates and according to search condition the data index is retrieved;

Merge a result for retrieval server group of planes, comprise that many merge the result for retrieval servers, receive search condition, receive and merge the result that said retrieval server detects.

Each said retrieval server includes a backup area, is used to back up the data that this retrieval server retrieves.

Also comprise a Backup Data server group of planes, comprise many Backup Data servers, receive and back up the data on said data distributing server, said establishment data directory server, said retrieval server and the said merging result for retrieval server.

The index type that said establishment data directory server is created is an inverted index.

Said data distributing server stores a distribution of document that realizes function of data distribution, and said distribution of document records all position and information of creating data directory server, retrieval server and merging result for retrieval server.

Said distribution of document is the XML file.

Simultaneously, a kind of construction method of mass data information index is provided, comprises the steps:

1. will retrieve with server and be divided into data distributing server, create the data directory server, be used for retrieval server and be used to merge the result for retrieval server, be equipped with the number of servers of accomplishing each function according to the calculated amount of the function of retrieval according to function;

2. need set up indexed data is placed on the data distributing server with the form of file;

The index script is set up in operation, according to the quantity of creating the data directory server file is split, and is distributed to each establishment data directory server;

3. create the data directory server and receive file, and set up index for this document, and merge, at last newly-built index is transferred on the retrieval server with the index of setting up before;

4. search condition sends to and merges the result for retrieval server;

5. the result for retrieval server sends to search condition on all retrieval servers;

6. retrieval server receives search condition and the index of having set up is retrieved, and result for retrieval is returned to merge the result for retrieval server;

7. merge the result for retrieval server and receive the laggard line data merging of all result for retrieval data, and the result for retrieval after will merging returns to retrieval user.

The step that also comprises the data that produce in each step of backup in the said step.

The 3. middle index of creating of said step is an inverted index.

Said step 1. in through on data distributing server, storing a record data Distributor, creating the distribution of document of data directory server, retrieval server and merging result for retrieval server location information, realize the division of each server capability.

Distribution of document is the XML file, and in the implementation, the XML file is through resolving, and the information of each function is dealt on the XML file on the specified server through the SSH technical point.

Realize transmission through the http agreement between said each step.

Technique scheme of the present invention is compared prior art and is had the following advantages:

Mass data information index system and method for the present invention distributes server according to the function that needs in the index process are accomplished, and disposes different servers and is used to accomplish different search functions, avoids robbing resource; Simultaneously, make that the responsibility of every station server is clearer and more definite; Can carry out customized configuration to different server group of planes according to the characteristic of difference in functionality, be convenient to improve recall precision; In a single day make mistakes in the process of retrieval, can orient the server of makeing mistakes fast, be convenient to maintenance maintenance, reduce maintenance and use cost according to the reason of makeing mistakes.

Description of drawings

For content of the present invention is more clearly understood, below according to a particular embodiment of the invention and combine accompanying drawing, the present invention is done further detailed explanation, wherein

Fig. 1 is the structural representation of the mass data information index system of an embodiment of invention;

Wherein, Reference numeral is expressed as: the 1-data distributing server, and 2-creates the data directory server, the 3-retrieval server, 4-merges the result for retrieval server.

Embodiment

Referring to shown in Figure 1 be the mass data information index system of one embodiment of the invention, comprise that a data distributing server group of planes comprises two data distributing servers 1, the data that are used for that needs are created split and are distributed to creates data directory server 2;

Create a data directory server group of planes, comprise that four are created data directory server 2, receive the data of said data distributing server 1 distribution and be said data creation inverted index; Wherein, Inverted index comes from the practical application need search record according to the value of attribute; In this concordance list each all comprises a property value and each recording address with this property value; Owing to be not to confirm property value, but confirm the position of record, thereby be inverted index by that value of attribute by record;

A retrieval server group of planes; Comprise four retrieval servers 3; Receive said data directory that said establishment data directory server 2 creates and the data index is retrieved according to search condition; Every said retrieval server 3 includes a backup area, is used to store the data that this retrieval server 3 retrieves;

Merge a result for retrieval server group of planes, comprise that two merge result for retrieval servers 4, receive result that said retrieval server 3 ropes go out and said result is merged.

A distortion as the foregoing description; This system also comprises a Backup Data server group of planes; It comprises two Backup Data servers; Receive data and backup on said data distributing server 1, said establishment data directory server 2, said retrieval server 3 and the said merging result for retrieval server 4, said retrieval server 3 does not possess backup functionality, and other are with above-mentioned embodiment; Can realize the object of the invention equally, belong to protection scope of the present invention.

XML file in the foregoing description can be YAML (YAML Ain't Markup Language, meaning YAML is not a kind of SGML) file or JSON (English full name is JavaScriptObject Notation, means the exchanges data language) file.

The mass data information index system of the above embodiment of the present invention can guarantee utilization of resources maximization according to the corresponding number of servers that how much is equipped with of the workload of the needs of accomplishing difference in functionality, is convenient to improve recall precision; Simultaneously, difference in functionality is accomplished by different servers and has also been avoided robbing the resource phenomenon; In a single day make mistakes in the process of retrieval, can orient the server of makeing mistakes fast, be convenient to maintenance maintenance, reduce maintenance and use cost according to the reason of makeing mistakes; And said system can dynamically increase the server of corresponding function according to the workload of retrieval among the present invention, and dynamic scalability is strong.

The present invention simultaneously provides a kind of construction method of mass data information index of the foregoing description, and it comprises the steps:

1. on data distributing server 1, store the XML file of a record data Distributor 1, establishment data directory server 2, retrieval server 3 and merging result for retrieval server 4 positional informations; To retrieve with server and be divided into data distributing server 1, create data directory server 2, be used for retrieval server 3 and be used to merge result for retrieval server 4, be equipped with the number of servers of accomplishing each function according to the calculated amount of the function of retrieval according to function; In the present embodiment, comprise two data distributing servers 1, four create data directory server 2, four retrieval servers 3, two merge result for retrieval server 4 and two Backup Data servers; In the implementation, the XML file is through resolving, and the information of each function is dealt on the XML file on the specified server through the SSH technical point; Wherein, XML, English full name are Extensible Markup Language, mean extend markup language, and SSH, English full name are based upon the security protocol on application layer and the transport layer basis for Secure Shell means; Above-mentioned SSH can be existing any host-host protocol and replaces, and such as the http host-host protocol, selects that SSH is simple relatively, safety for use;

2. need set up indexed data is placed on the data distributing server 1 with the form of file; The index script is set up in operation, according to the quantity of creating data directory server 2 file is separated, and every data are distributed to each establishment data directory server 2 through " http " request mode; Wherein, http, English full name are hypertext transport protocol, mean HTTP; Wherein, Needing to create indexed data and format processing, specifically is the framework of data of definition on data distributing server 1, and definition needs to create the column information (comprising title and type) of indexed data; This is comprising an ID attribute, to distinguish each record; The also available any host-host protocol of the prior art of above-mentioned http replaces, such as ftp; Wherein, ftp, English full name is File TransferProtocol, means FTP;

3. create data directory server 2 and receive files, and set up inverted index for this document, and merge, at last newly-built index is transferred on the retrieval server 3 through " http " request mode with the index of setting up before;

4. search condition sends to and merges result for retrieval server 4, and wherein, needing data retrieved must be formatted data;

5. merging result for retrieval server 4 sends to search condition on all retrieval servers 3 through " http " request mode;

6. retrieval server 3 receives search condition and the index of having set up is retrieved, and result for retrieval is returned to merge result for retrieval server 4, simultaneously result for retrieval is backed up on retrieval server 3;

7. merge the laggard line data merging of result for retrieval data that result for retrieval server 4 receives all " http " responses, and the result for retrieval after will merging returns to retrieval user.

As a distortion of said method, the data on any one server all with the index data that changes, are transferred on the backup server through " http " request mode, to prevent loss of data when changing.

Below in conjunction with a concrete embodiment mass data information index system of the present invention and construction method are elaborated:

File so that an actual needs is handled is given an example; Contain 100 data in the file; Mass data information index system has five to create 3, one data distributing servers 1 of 2, five retrieval servers of data directory server and a merging result for retrieval server 4 now:

The detailed process that establishment index user sets up index is following:

Need file be placed on the data distributing server 1, the entering program is placed root directory on data distributing server 1 then, and carries out and set up index order " shbin/distdaemon.shpost{ places document location } ".

Data distributing server 1 can find this document through the fill order parameter, and through " index fractionation " module, this file fractionation is become five parts, and each part " index data piece " all can contain 20 data.

After data distributing server 1 splits completion through " index fractionation " module with file, can five parts " index data pieces " be distributed to five successively and create on the data directory servers 2.

With single establishment data directory server 2 is example (all the other processing modes of creating data directory server 2 are consistent); Creating data directory server 2 when one of them receives after data distributing server 1 sends to its " index data piece " (wherein containing 20 data); (schema) the row name and the type of document definition split according to " framework " that define before can will to be somebody's turn to do " index data piece "; Obtaining the array of a similar HASH (meaning hash) form, is that filename (filename), secondary series are file size (filesize) such as schema document definition first row, and first row that " index data is fast " obtains are file01; Secondary series is 1000M; So through the data fractionations can obtain an array " { [0]=> { ' filename '=> ' file01 ', ' filesize '=' 1000M ', [1]={ ... }; ..., [20]={ ... } ".

Can the array of this hash form that obtains be generated as " arranging data directory " through arranging algorithm afterwards.Data layout like hash be ' 001 '=> { ' filename '=> ' file01 ', ' filesize '=' 1000M ', ' 002 '=> { ' filename '=> ' file02 '; ' filesize '=> ' 1000M ' } ..., ' 020 '=> { ' filename '=> ' file20 '; ' filesize '=> ' 1000M ' }, the form after accomplishing through " row of falling " so should be { ' filename '=> { ' file01 '=> { ' 001 ' }, and ' file02 '={ ' 002 ' }; ' file20 '=> { ' 020 ' } }, ' filesize '=> { ' 1000M '=> { ' 001 ', ' 002 '; ' 020 ' } } ....

After creating data directory server 2 generation inverted indexs, can newly-established index be merged in the existing index through " merging index ".

At last the mode of newly-established index through " http " agreement is synchronized in the retrieval server 3.

The detailed process of retrieval user search index data is following:

Retrieval user is through sending retrieval request to merging result for retrieval server 4, and sending retrieving files size (filesize) such as retrieval user is the All Files of 1000M.Then the user can send a http: the // { merger-host}/{ port}/{ path}/{ request of fn} q=filesize%3A1000M.

After creating index server and receiving the request that the user sends, should ask to send to respectively five retrieval servers 3 as:

http://{slave-host}/{port}/{path}/{fn}?q=filesize%3A1000M。

With single retrieval server 3 is the example processing mode of all the other retrieval servers 3 (consistent), receive create the request that data directory server 2 sends after, obtain the search condition that filesize equals 1000M through analysis request.

Through this condition retrieve data in already present " arranging data directory ".Obtain { ' 001 '=> { ' filename '=> ' file01 ' such as meeting; ' filesize '=> ' 1000M ' }; ' 002 '=> { ' filename '=> ' file02 ', ' filesize '=' 1000M ' }, ' 020 '=> { ' filename '=> ' file20 '; ' filesize '=' 1000M ' the result, and the result returned to merge result for retrieval server 4.

Merge result for retrieval server 4 and obtain the data that five retrieval servers 3 return, with their sequencing by merging etc.

After merging completion, the result is returned to the client, search complete.

Test through actual:

A data distributing server 1 and three establishment data directory servers 2 are built 5,000 ten thousand data directories, and every data probably have 20-30 data characteristic, spend 30 minutes and can accomplish the work of creating index;

One merges result for retrieval server 4, five retrieval servers 3, has the index about 100,000,000,100 user concurrents retrievals, and search condition comprises range of search, complex conditions such as ordering, average response is in three seconds.

Obviously, the foregoing description only be for explanation clearly done for example, and be not qualification to embodiment.For the those of ordinary skill in affiliated field, on the basis of above-mentioned explanation, can also make other multi-form variation or change.Here need not also can't give exhaustive to all embodiments.And conspicuous variation of being extended out thus or change still are among the protection domain of the invention.

Claims

1. mass data information index system is characterized in that: comprises,

Merge a result for retrieval server group of planes, comprise that many merge the result for retrieval servers, receive and merge the result that said retrieval server detects.

2. a kind of mass data information index according to claim 1 system, it is characterized in that: each said retrieval server includes a backup area, is used to back up the data that this retrieval server retrieves.

3. a kind of mass data information index according to claim 1 system; It is characterized in that: also comprise a Backup Data server group of planes; Comprise many Backup Data servers, receive and back up the data on said data distributing server, said establishment data directory server, said retrieval server and the said merging result for retrieval server.

4. according to the arbitrary described a kind of mass data information index of claim 1-3 system, it is characterized in that: the index type that said establishment data directory server is created is an inverted index.

5. according to the arbitrary described a kind of mass data information index of claim 1-4 system; It is characterized in that: said data distributing server stores a distribution of document that realizes function of data distribution, and said distribution of document records all position and information of creating data directory server, retrieval server and merging result for retrieval server.

6. a kind of mass data information index according to claim 5 system, it is characterized in that: said distribution of document is the XML file.

7. the construction method like the arbitrary said mass data information index of claim 1-6 is characterized in that, comprises the steps:

4. search condition sends to and merges the result for retrieval server;

8. mass data information index construction method according to claim 7 is characterized in that: the step that also comprises the data that produce in each step of backup in the said step.

9. according to claim 7 or 8 described mass data information index construction methods, it is characterized in that: the 3. middle index of creating of said step is an inverted index.

10. according to the arbitrary described mass data information index construction method of claim 7-9; It is characterized in that: said step 1. in through on data distributing server, storing a record data Distributor, creating the distribution of document of data directory server, retrieval server and merging result for retrieval server location information, realize the division of each server capability.

11. mass data information index construction method according to claim 10; It is characterized in that: distribution of document is the XML file; In the implementation, the XML file is through resolving, and the information of each function is dealt on the XML file on the specified server through the SSH technical point.

12., it is characterized in that: realize transmission through the http agreement between said each step according to the arbitrary described mass data information index construction method of claim 7-11.