CN104008126A

CN104008126A - Method and device for segmentation on basis of webpage content classification

Info

Publication number: CN104008126A
Application number: CN201410126465.1A
Authority: CN
Inventors: 项碧波
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Priority date: 2014-03-31
Filing date: 2014-03-31
Publication date: 2014-08-27
Also published as: WO2015149533A1

Abstract

The embodiment of the invention provides a method and a device for segmentation on the basis of webpage content classification. The method comprises the following steps of: extracting the text information of webpage contents in search resources; dividing the classes of the text information according to the classes of the webpage contents; segmenting the text information according to segmentation dictionaries corresponding to the classes of the text information. According to the embodiment of the invention, the classes of the text information of the webpage contents in the search resources are divided, and the text information is segmented on the basis of the segmentation dictionaries corresponding to the classes, so as to adapt to different classes of language characteristics better, meanwhile, the segmentation accuracy for different classes is also improved, and the optimal processing for local segmentation is realized; moreover, the improvement of the accuracy of segmentation is close to the intention of a user and improve the user experience, and then reduce the operations of re-input, search and the like of the user, and improve the simplicity of operation, meanwhile, the response of equipment on the operation of the user is reduced, and the consumption of the system resources of the equipment is reduced.

Description

A kind of method and apparatus that carries out word segmentation processing based on web page contents classification

Technical field

The technical field that the present invention relates to search, particularly relates to a kind of method and a kind of device that carries out word segmentation processing based on web page contents classification that carries out word segmentation processing based on web page contents classification.

Background technology

Along with the high speed development of internet, the diversification of network application trend, online quantity of information sharply increases.

Under various occasions, user often need to input key message and carry out obtaining of related information.For example, in search engine, input keyword search info web, in forum, input keyword search model etc.

Participle is the basis of carrying out information processing, information retrieval, and all information processing, information retrieval work are all carried out after participle.So the mistake of participle can be added in follow-up processing procedure, and be difficult to be eliminated.So the pursuit of the accuracy rate to participle is a lasting process Just because of this, simultaneously due to the intrinsic characteristic of Chinese language: do not have not have between definition, word and the word of clear and definite word these factors such as separator, neologisms, proper noun continue to bring out cause participle be difficult to accomplish 100% accurately.

The main method that current Words partition system uses is the participle based on statistics.From in form, word is the combination of stable word, and therefore, in context, the number of times that adjacent word occurs is simultaneously more, just more likely forms a word.Therefore the frequency of word and the adjacent co-occurrence of word or probability can reflect into the confidence level of word preferably.Can add up the frequency of the combination of each word of adjacent co-occurrence in language material, calculate their information that appears alternatively.The information that appears alternatively of two words of definition, the adjacent co-occurrence probabilities of calculating two Chinese character X, Y.The information of appearing alternatively has embodied the tightness degree of marriage relation between Chinese character.When tightness degree is during higher than some threshold values, just can think that this word group may form a word.This method only need be added up the word group frequency in language material, but this method also has certain limitation, can often extract some co-occurrence frequency out high but be not the everyday character group of word, for example " this ", " one of ", " having ", " I ", " many " etc., and the accuracy of identification to everyday words is poor, space-time expense is large.

On the one hand, the erroneous results of participle causes the related information that the later stage obtains to have very big difference with expection originally, and it is very poor that user experiences, and wasted apparatus system resource.On the other hand, user need to obtain related information, can again input key message search, equipment will again carry out search, contrast, screening of magnanimity information etc. and obtain the information relevant to searched key word, not only user's operation is more loaded down with trivial details, expend user's time, and will greatly increase the burden of equipment, expend more device resource.

Summary of the invention

In view of the above problems, the present invention has been proposed to a kind of a kind of method and corresponding a kind of device that carries out word segmentation processing based on web page contents classification that carries out word segmentation processing based on web page contents classification that overcomes the problems referred to above or address the above problem is at least in part provided.

According to one aspect of the present invention, a kind of method of carrying out word segmentation processing based on web page contents classification is provided, comprising:

Extract the text message of web page contents in searching resource;

According to classification under text message described in described web page contents category division;

According to dictionary for word segmentation corresponding to classification under described text message, described text message is carried out to word segmentation processing.

Alternatively, also comprise:

For described classification, the first participle that adopts word segmentation processing to obtain is set up inverted index.

Alternatively, described for described classification, the step that the first participle that adopts word segmentation processing to obtain is set up inverted index comprises:

For described classification, the appearance position of the first participle corresponding to described classification that record occurs;

The described first participle and corresponding appearance position thereof are recorded in inverted index.

Alternatively, described dictionary for word segmentation generates in the following manner:

Obtain the first training document;

Divide the affiliated classification of described the first training document;

The described first training document corresponding to described classification carries out word segmentation processing, obtains the second participle;

Add up word frequency number and the first co-occurrence rate of described the second participle corresponding to described classification;

Adopt described the second participle and the first co-occurrence rate thereof to generate dictionary for word segmentation corresponding to described classification.

Alternatively, described the first co-occurrence rate comprises the ratio of the first word frequency number and the second word frequency number;

Wherein, described the first word frequency number comprises that current the second participle appears at target the second participle word frequency number afterwards; Described target the second participle comprises one or more the second participles that appear at before current environment participle;

Described the second word frequency number comprises the total word frequency number of described target the second participle.

Alternatively, described dictionary for word segmentation upgrades in the following manner:

Obtain the second training document;

Divide the affiliated classification of described the second training document;

According to dictionary for word segmentation corresponding to described classification, described text message is carried out to word segmentation processing, obtain the 3rd participle

Add up word frequency number and the second co-occurrence rate of described the 3rd participle corresponding to described classification;

Adopt described the 3rd participle and the second co-occurrence rate thereof to upgrade dictionary for word segmentation corresponding to described classification.

Alternatively, described the second co-occurrence rate comprises the ratio of the 3rd word frequency number and the 4th word frequency number;

Wherein, described the 3rd word frequency number comprises that current the 3rd participle appears at target the 3rd participle word frequency number afterwards; Described target the 3rd participle comprises one or more the 3rd participles that appear at before current environment participle;

Described the 4th word frequency number comprises the total word frequency number of described target the 3rd participle.

Alternatively, described according to dictionary for word segmentation corresponding to classification under described text message, the step of described text message being carried out to word segmentation processing comprises:

According to dictionary for word segmentation corresponding to classification under described text message, and universaling dictionary, described text message is carried out to word segmentation processing;

By the highest participle of word frequency number obtaining after word segmentation processing, the first participle obtaining as word segmentation processing.

Alternatively, described according to dictionary for word segmentation corresponding to the classification under described text message, the step of described text message being carried out to word segmentation processing comprises:

When the classification under described text message is while being multiple, according to dictionary for word segmentation corresponding to described classification, described text message is carried out to word segmentation processing respectively;

According to a further aspect in the invention, provide a kind of device that carries out word segmentation processing based on web page contents classification, having comprised:

Extraction module, is suitable for extracting the text message of web page contents in searching resource;

Divide module, be suitable for according to classification under text message described in described web page contents category division;

Word-dividing mode, is suitable for, according to dictionary for word segmentation corresponding to classification under described text message, described text message being carried out to word segmentation processing.

Alternatively, also comprise:

Set up module, be suitable for for described classification, the first participle that adopts word segmentation processing to obtain is set up inverted index.

Alternatively, the described module of setting up is also suitable for:

Obtain the first training document;

Divide the affiliated classification of described the first training document;

Obtain the second training document;

Divide the affiliated classification of described the second training document;

Alternatively, described word-dividing mode is also suitable for:

The embodiment of the present invention is divided classification to the text message of web page contents in searching resource, based on such other dictionary for word segmentation, this paper information is carried out to participle, adapt to better different classes of characteristic of speech sounds, also improved different classes of participle accuracy, realized the optimal processing of local participle simultaneously; And, the raising of participle accuracy, the intention of being more close to the users, promote user's experience, then reduced the operations such as user re-enters, search, improved the simplicity of operation, also reduced the response of equipment to user's operation, reduced expending of apparatus system resource simultaneously.

The embodiment of the present invention is divided classification to the text message of web page contents in searching resource, based on such other dictionary for word segmentation, this paper information is carried out to participle, the first participle that adopts again word segmentation processing to obtain is set up inverted index, unicity and the one-sidedness of the inverted index based on overall text message are avoided, promote the accuracy rate of inverted index in each classification, then promote the index operational efficiency of inverted index, reduced the time of index; And, in searching resource the text message of webpage comprise newly, strange, the various text messages that meet this classification characteristic of speech sounds of top grade, utilize other people and the collective wisdom in searching resource, collected, made up self definition, own artificial deficiency, greatly reduced artificial operation cost.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.

Brief description of the drawings

By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skill in the art.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:

Fig. 1 shows a kind of according to an embodiment of the invention flow chart of steps of carrying out the embodiment of the method for word segmentation processing based on web page contents classification; And

Fig. 2 shows a kind of according to an embodiment of the invention structured flowchart that carries out the device embodiment of word segmentation processing based on web page contents classification.

Embodiment

Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, but should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can be by the those skilled in the art that conveys to complete the scope of the present disclosure.

With reference to Fig. 1, show a kind of according to an embodiment of the invention flow chart of steps of carrying out the embodiment of the method for word segmentation processing based on web page contents classification, can comprise the steps:

Step 101, the text message of web page contents in extraction searching resource;

The treatment scheme of search engine generally can be divided into two parts, and Part I is front end subscriber request, and Part II is that rear end makes data.

One, front end subscriber request processing procedure can comprise:

1. user entered keyword;

2. query word analysis, search engine is to key word participle;

3. retrieval, according to word segmentation result, from the index of prior making, finds out relevant collections of web pages;

4. sequence, to candidate's collections of web pages, sorts according to content relevance, the dimension such as ageing;

5. represent: the webpage after sequence is represented.

Two, making data procedures in rear end can comprise:

1. webpage captures, and reptile, by the linking relationship between webpage, captures the webpage of internet and preserves;

2. compilation of index, analyzes the webpage that captures preservation, to web page title and page text participle, makes inverted index according to word segmentation result, retrieves for front end.

The webpage of crawler capturing can be kept at and in web database, form a large amount of searching resources, and can comprise a large amount of text messages in web page contents.In embodiments of the present invention, can from web database, extract the text message of web page contents in searching resource.

Step 102, according to classification under text message described in described web page contents category division;

In a kind of situation, can obtain webpage classification based on the network address of webpage, then according to classification under webpage category division text message.For example, the domain name that the webpage network address in animation field generally identifies with " comic ", the domain name that the webpage network address of sports field generally identifies with " sports ", as comic.XXX.com, sports.XXX.com, detect the domain name of " comic " or " sports " mark in webpage network address time, can identify this webpage classification is animation field or sports field, then text message can be divided into animation field or sports field.

In another situation, the webpage of access with label (tag) information, can obtain webpage classification based on label information conventionally, then according to classification under webpage category division text message.For example, certain webpage is with label informations such as video, film, comedy movies, and can identify this webpage classification is film field, then text message is divided into film field.

In another situation, can the middle specific word of analyzing web page content title (topic) know webpage classification, be sports field such as containing the specific word such as basketball, football, NBA or world cup in title known.

In another situation, in can navigate at webpage (such as crumbs navigation Breadcrumb Trail), know webpage classification, then according to classification under webpage category division text message.Crumbs navigation occurs in page top level conventionally, generally can be positioned at the below of title or top margin.Crumbs navigation offer user return before the link (these links are also the paths that can arrive current page) of any one page, the parent page of this page normally in level framework.Crumbs navigation offers user and dates back to a paths of website homepage or portal page, normally occurs with greater-than sign (>), and also having some designs is other symbols (as >>).For example, " homepage > classification page > subclassificatio page " or " homepage >> classification page >> subclassificatio page ".From crumbs navigation, classification can be known the self-defined classification of webpage, the crumbs navigation of for example webpage is " super in the > of XX portal website physical culture > Chinese football > ", its corresponding Web page classifying be physical culture, Chinese football, in super, choose and the classification of mating in embodiment of the present invention practical application.

It should be noted that, the embodiment of the present invention can be according to actual needs, divide the grade of classification, for example, for sports field, except dividing the classification for whole sports field, also can divide basketball, the classification of the next stage such as football, can also divide NBA(National Basketball Association, national basketball association), CBA(Chinese Basketball Association, China Basketball Association), Division A League Matches of Spanish Football (Spain's soccer league match), in the more classification of next stage such as super (Chinese football super league), the embodiment of the present invention is not limited this.

And along with the grade classification precision of classification is higher, the overlapping probability of its collected text message is also just lower, otherwise accurately just higher, in the embodiment of the present invention, the degree of accuracy of participle also can be higher.

Step 103, according to dictionary for word segmentation corresponding to classification under described text message, carries out word segmentation processing to described text message.

In specific implementation, each classification can be to there being a specific dictionary for word segmentation, text message is conformed with to the participle of this classification language feature.

In a preferred embodiment of the present invention, described dictionary for word segmentation can generate in the following manner:

Sub-step S11, obtains the first training document;

In the invention process, the first training document can be the text message of webpage in searching resource.

Sub-step S12, divides the affiliated classification of described the first training document;

In specific implementation, in the time that the first training document can be the text message of webpage in searching resource, can train classification under document according to this webpage category division first.

It should be noted that, because sub-step S11, sub-step S12 are substantially similar to the application of step 101, step 102, so that describes is fairly simple, relevant part is referring to the part explanation of step 101, step 102, and the embodiment of the present invention is not described in detail at this.

Sub-step S13, the described first training document corresponding to described classification carries out word segmentation processing, obtains the second participle;

The application embodiment of the present invention, can adopt general dictionary for word segmentation first to train document to carry out word segmentation processing.General dictionary for word segmentation can be the dictionary for word segmentation of general field, does not have technical term in specific specific field to divide, and such as Radix Angelicae Sinensis, the radix aconiti agrestis etc. of field of Chinese herbal, it mainly comprises general, the frequency of occurrences entry higher than predetermined threshold value.Particularly, in general dictionary for word segmentation, can comprise general term and definite meaning word.General term can comprise the verb of adjective, conjunction and some general meanings, for example glad, still, participation etc.Determine that meaning word can be the word that can express definite meaning scope, normally some nouns and verb.

Sub-step S14, word frequency number and the first co-occurrence rate of adding up described the second participle corresponding to described classification;

In a kind of preferred exemplary of the embodiment of the present invention, can be based on the second participle training N-Gram model.

In this example, the probability that the first co-occurrence rate can occur for two or more second participles simultaneously.Particularly, described the first co-occurrence rate can comprise the ratio of the first word frequency number and the second word frequency number;

N-Gram model is a kind of language model conventional in large vocabulary continuous speech recognition, and based on Markov hypothesis, i.e. the appearance of a word only depends on one or several limited words that it occurs above.For a sentence T, can suppose that T is by word sequence W ₁, W ₂, W ₃..., W _ncomposition, this sentence T is by W so ₁, W ₂, W ₃..., W _nthe connection probability connecting to form is P (T)=P (W ₁w ₂w ₃w _n)=P (W ₁) P (W ₂| W ₁) P (W ₃| W ₁w ₂) ... P (W _n| W ₁w ₂w _n-1).

If the appearance of a word only depends on the word that it occurs above, be referred to as bigram.Be P (T)=P (W ₁w ₂w ₃w _n)=P (W ₁) P (W _2|w ₁) P (W ₃| W ₁w ₂) ... P (W _n| W ₁w ₂w _n-1) ≈ P (W ₁) P (W ₂| W ₁) P (W ₃| W ₂) ... P (W _n| W _n-1).

If the appearance of a word only depends on two words that it occurs above, be referred to as trigram.In the practical application of N-Gram model taking bigram and trigram as main, and less higher than the N-Gram model application of quaternary, because the N-Gram model of training quaternary needs huger language material, and Sparse is serious, time complexity is high, and it is few that precision but improves.

Describe as an example of text message " I want to eat Chinese food lunch " example below:

The second text message in certain environmental form " I want eat Chinese food lunch " is carried out to participle, obtain the second participle " I ", " want ", " to ", " eat ", " Chinese ", " food ", " lunch ", shown in this second participle and word frequency table 1 and table 2.

Total word frequency of table 1 the second participle is counted statistical form

The second participle	Total word frequency number
		I	3437
want	1215
		to	3256
eat	938
		Chinese	213
food	1506
		lunch	459

The word frequency that current the second participle of table 2 appears at before target the second participle is counted statistical form

	I	want	to	eat	Chinese	food	lunch
								I	8	1087	0	13	0	0	0

want	3	0	786	0	6	8	6
								to	3	0	10	860	3	0	12
eat	0	0	2	0	19	2	52
								Chinese	2	0	0	0	0	120	1
food	19	0	17	0	0	0	0
								lunch	4	0	0	0	0	1	0

For example, 1087 in the second row the 3rd row represent that it is 1087 that current the second participle " want " appears at target the second participle " I " word frequency number below.

Sub-step S15, adopts described the second participle and the first co-occurrence rate thereof to generate dictionary for word segmentation corresponding to described classification.

In the participle process based on dictionary for word segmentation, the inquiry mode of generally using can comprise: Hash inquiry, TRIE tree (claiming again word lookup tree or key tree) inquiry, two points of inquiries and sequential query etc.And in actual participle process, can use multiple inquiry mode; for example, in the dictionary for word segmentation of Hash mechanism; can adopt Hash inquiry and two points of modes that inquiry combines; TRIE tree query and two points of modes that inquiry combines in the dictionary for word segmentation of TRIE mechanism, are adopted; multiple inquiry mode is combined and inquired about, can improve search efficiency.

In embodiments of the present invention, can inquire about according to Hash, the inquiry mechanism of one or more formation in TRIE tree query, two points of inquiries and sequential query etc. inquiry mode, generate dictionary for word segmentation, with realize that Hash inquiry, TRIE tree query, two points are inquired about and sequential query etc. inquiry mode in the inquiry mechanism of one or more formation.

In a preferred embodiment of the present invention, described dictionary for word segmentation can upgrade in the following manner:

Sub-step S21, obtains the second training document;

In the invention process, the second training document can be the text message of webpage in searching resource.

Sub-step S22, divides the affiliated classification of described the second training document;

In specific implementation, in the time that the second training document can be the text message of webpage in searching resource, can train classification under document according to this webpage category division second.

It should be noted that, because sub-step S21, sub-step S22 are substantially similar to the application of step 101, step 102, so that describes is fairly simple, relevant part is referring to the part explanation of step 101, step 102, and the embodiment of the present invention is not described in detail at this.

Sub-step S23, according to dictionary for word segmentation corresponding to described classification, carries out word segmentation processing to described text message, obtains the 3rd participle;

The application embodiment of the present invention, can adopt affiliated dictionary for word segmentation second corresponding to classification of the second training document to train document to carry out word segmentation processing.To treating the character string of participle in given the second training document, according to certain definite principle, for example Forward Maximum Method method (MM), oppositely maximum matching method (RMM) or bilateral scanning method etc., cut the substring of character string, if certain entry matches in this substring and dictionary for word segmentation, think that this substring is the 3rd participle, insert cutting mark, continue to cut apart remaining part, know that remainder is empty; Otherwise this substring is not the 3rd participle, and the substring that again cuts character string mates next time.

Sub-step S24, word frequency number and the second co-occurrence rate of adding up described the 3rd participle corresponding to described classification;

In a kind of preferred exemplary of the embodiment of the present invention, can be based on the 3rd participle training N-Gram model.

In this example, the probability that the second co-occurrence rate can occur for two or more the 3rd participles simultaneously.Particularly, described the second co-occurrence rate can comprise the ratio of the 3rd word frequency number and the 4th word frequency number;

Sub-step S25, adopts described the 3rd participle and the second co-occurrence rate thereof to upgrade dictionary for word segmentation corresponding to described classification.

In embodiments of the present invention, can inquire about according to Hash, the inquiry mechanism of one or more formation in TRIE tree query, two points of inquiries and sequential query etc. inquiry mode, upgrade dictionary for word segmentation.

In a preferred embodiment of the present invention, can directly adopt affiliated dictionary for word segmentation corresponding to classification of text information, text message is carried out to word segmentation processing.

In a preferred embodiment of the present invention, step 103 can comprise following sub-step:

Sub-step S31, according to dictionary for word segmentation corresponding to classification under described text message, and universaling dictionary, described text message is carried out to word segmentation processing;

Sub-step S32, by the highest participle of word frequency number obtaining after word segmentation processing, the first participle obtaining as word segmentation processing.

In embodiments of the present invention, can use affiliated dictionary for word segmentation corresponding to classification of text message, and universaling dictionary (general dictionary for word segmentation) carries out word segmentation processing to text message simultaneously.

For example, for text message " people participates in Radix Angelicae Sinensis ", owing to must be the word frequency that the word frequency of " people ", " participation " is greater than " ginseng ", "AND" in overall language material, so based on universaling dictionary participle, this text message can be cut into " people ", " participation ", " Radix Angelicae Sinensis ", but obviously such cutting is wrong in fact." people participates in Radix Angelicae Sinensis " this text message often appears in the document of health care class, if the dictionary for word segmentation of the health care class based under it carries out participle, this information can be cut into " ginseng ", "AND", " Radix Angelicae Sinensis ".Relatively, after word frequency number, can find that the relative term frequency of " ginseng " can be significantly higher than the relative term frequency in overall language material, therefore, finally can choose the word segmentation result of " ginseng ", "AND", " Radix Angelicae Sinensis " conduct " people participates in Radix Angelicae Sinensis ".

In another preferred embodiment of the present invention, step 103 can comprise following sub-step:

Sub-step S41, when the classification under described text message is while being multiple, according to dictionary for word segmentation corresponding to described classification, carries out word segmentation processing to described text message respectively;

Sub-step S42, by the highest participle of word frequency number obtaining after word segmentation processing, the first participle obtaining as word segmentation processing.

In embodiments of the present invention, text message can be divided into multiple classifications, belongs to crossing domain.For example, can be mechanical field about the classification of the text message of aircraft, can be also aviation field.

In the situation of this kind of crossing domain, can carry out word segmentation processing based on dictionary for word segmentation corresponding to affiliated classification to this paper information respectively, finally using word frequency the highest as word segmentation result.

The main method that current Words partition system uses is the participle based on statistics, is exactly briefly mainly to rely on the information such as word frequency, transition probability between candidate word in the cut-off that determines word.Since be statistics, must be to meet great majority to sacrifice fractions, what that is to say pursuit is the optimum in overall statistical significance, but not the optimum of each part makes in local word segmentation processing accuracy very low.

In a preferred embodiment of the present invention, can also comprise the steps:

Step 104, for described classification, the first participle that adopts word segmentation processing to obtain is set up inverted index.

Inverted index comes from practical application to be needed to search record according to the value of attribute.Each in this concordance list all comprises a property value and has the address of each record of this property value.Because not being determines property value by recording, but determined the position of recording by property value, thereby be called inverted index (inverted index).Be called inverted index file with the file of inverted index, be called for short inverted file (inverted file).

Inverted file (inverted index), index object is such as, word in document or collection of document (webpage) etc., being used for storing the memory location of these words in a document or one group of document, is a kind of conventional Indexing Mechanism to document or collection of document.

In a preferred embodiment of the present invention, step 104 can comprise following sub-step:

Sub-step S51, for described classification, the appearance position of the first participle corresponding to described classification that record occurs;

Sub-step S52, is recorded in the described first participle and corresponding appearance position thereof in inverted index.

In specific implementation, the appearance position of the first participle can comprise the webpage of appearance, or, the webpage of appearance and the position in this webpage thereof.

Taking English as example, be below the text message in webpage that will be indexed:

T1="it is what it is"；

T2="what is it"；

T3="it is a banana"；

Be below inverted index:

"a": {(2,2)}

"banana":{(2,3)}

"is": {(0,1),(0,4),(1,1),(2,1)}

"it": {(0,0),(0,3),(1,2),(2,0)}

"what":{(0,2),(1,0)}

Wherein, " banana ": { (2,3) } be " banana " in the text message of the 3rd webpage (T3), and be the 4th word (address is 3) in the position of the 3rd webpage.

For embodiment of the method, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the embodiment of the present invention is not subject to the restriction of described sequence of movement, because according to the embodiment of the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action might not be that the embodiment of the present invention is necessary.

With reference to Fig. 2, show and show according to an embodiment of the invention a kind of according to an embodiment of the invention structured flowchart that carries out the device embodiment of word segmentation processing based on web page contents classification, can comprise as lower module:

Extraction module 201, is suitable for extracting the text message of web page contents in searching resource;

Divide module 202, be suitable for according to classification under text message described in described web page contents category division;

Word-dividing mode 203, is suitable for, according to dictionary for word segmentation corresponding to classification under described text message, described text message being carried out to word segmentation processing.

In a preferred embodiment of the present invention, can also comprise as lower module:

In a preferred embodiment of the present invention, the described module of setting up can also be suitable for:

Obtain the first training document;

Divide the affiliated classification of described the first training document;

In a preferred embodiment of the present invention, described the first co-occurrence rate can comprise the ratio of the first word frequency number and the second word frequency number;

Obtain the second training document;

Divide the affiliated classification of described the second training document;

In a preferred embodiment of the present invention, described the second co-occurrence rate can comprise the ratio of the 3rd word frequency number and the 4th word frequency number;

In a preferred embodiment of the present invention, described word-dividing mode 203 can also be suitable for:

For device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.

The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.

In the instructions that provided herein, a large amount of details are described.But, can understand, embodiments of the invention can be put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.But, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them in addition multiple submodules or subelement or sub-component.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.

In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature instead of further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.

All parts embodiment of the present invention can realize with hardware, or realizes with the software module of moving on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that can use in practice microprocessor or digital signal processor (DSP) to realize carries out the some or all functions of the some or all parts in word segmentation processing equipment according to the embodiment of the present invention based on web page contents classification.The present invention can also be embodied as part or all equipment or the device program (for example, computer program and computer program) for carrying out method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.

It should be noted above-described embodiment the present invention will be described instead of limit the invention, and those skilled in the art can design alternative embodiment in the case of not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has multiple such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim of having enumerated some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims

1. a method of carrying out word segmentation processing based on web page contents classification, comprising:

Extract the text message of web page contents in searching resource;

2. the method for claim 1, is characterized in that, also comprises:

3. the method as described in claim 1 to 2 any one, is characterized in that, described for described classification, and the step that the first participle that adopts word segmentation processing to obtain is set up inverted index comprises:

4. the method as described in claims 1 to 3 any one, is characterized in that, described dictionary for word segmentation generates in the following manner:

Obtain the first training document;

Divide the affiliated classification of described the first training document;

5. the method as described in claim 1 to 4 any one, is characterized in that, described the first co-occurrence rate comprises the ratio of the first word frequency number and the second word frequency number;

6. the method as described in claim 1 to 5 any one, is characterized in that, described dictionary for word segmentation upgrades in the following manner:

Obtain the second training document;

Divide the affiliated classification of described the second training document;

7. the method as described in claim 1 to 6 any one, is characterized in that, described the second co-occurrence rate comprises the ratio of the 3rd word frequency number and the 4th word frequency number;

8. the method as described in claim 1 to 7 any one, is characterized in that, described according to dictionary for word segmentation corresponding to classification under described text message, the step of described text message being carried out to word segmentation processing comprises:

9. the method as described in claim 1 to 8 any one, is characterized in that, described according to dictionary for word segmentation corresponding to the classification under described text message, the step of described text message being carried out to word segmentation processing comprises:

10. a device that carries out word segmentation processing based on web page contents classification, comprising: