CN103049443A

CN103049443A - Method and device for mining hot-spot words

Info

Publication number: CN103049443A
Application number: CN2011103078466A
Authority: CN
Inventors: 罗侃; 陈洪亮; 杨志峰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2011-10-12
Filing date: 2011-10-12
Publication date: 2013-04-17

Abstract

The invention discloses a method and a device for mining hot-spot words. The method includes acquiring input text streams; subjecting the text streams to word segmentation to obtain a candidate word set; accounting a current frequency that each candidate word in the candidate word set appears in the text streams to acquire each history frequency of each candidate word in prestored history data; and calculating a frequency abnormity value of the candidate word according to the current frequency and each history frequency of the candidate word, storing a current frequency message of the candidate word in the history data, and outputting a preset number of candidate words with abnormal frequencies. By means of the method and the device for mining the hot-spot words, mining ranges of the hot-spot words can be extended, and mining efficiencies of the hot-spot words can be improved.

Description

A kind of method and apparatus that excavates the focus word

Technical field

The present invention relates to computer communication technology, particularly a kind of method and apparatus that excavates the focus word.

Background technology

Development along with computer communication technology, especially the development of 3g network and intelligent mobile terminal, user's the network life is more and more abundanter, in network chat, browse news, see a film, play games, search for, do shopping, release news etc., more and more becomes the part of the network life.For example, microblogging visitor (MicroBlog), namely microblogging as an Information Sharing based on customer relationship, propagate and obtain platform, the user can be set up individual community by WEB, WAP and various client, with the literal lastest imformation about 140 words, and realize immediately sharing.

Because Web content is abundant, it is also more and more that the network user therefrom obtains the time that relevant information spends, for the network that improves the user is experienced, the method that each operator excavates by the focus word, the time news that automatic acquisition is up-to-date, in time recommend to the network user, for example, text flow information according to the microblogging input, automatically the focus word that wherein comprises of identification, and recommend hot information to the user who pays close attention to, like this, when promoting network service, also effectively reduce the user and obtain the required time of hot information.

Fig. 1 is the existing method flow synoptic diagram that excavates the focus word.Referring to Fig. 1, this flow process comprises:

Step 101 is obtained the text flow of input;

In this step, process by the content that webpage, microblogging are comprised, obtain webpage, text flow corresponding to microblogging content, text flow can obtain according to the predefined time cycle, also can obtain at random.

Step 102 is carried out participle to text flow, obtains candidate's word set;

In this step, text flow is carried out participle obtain the word that comprises in the text flow, specifically can be referring to the correlation technique document.

Step 103 is mated candidate's word set of obtaining and the focus word vocabulary that sets in advance, and obtains focus candidate word set, and the frequency of statistics focus candidate word;

In this step, can put in advance, collect the word to be paid close attention to that may comprise in a collection of focus incident in order in artificial mode, words such as earthquake, fire, speech, accident, Beijing, tourism, shopping forms focus word vocabulary.

After the text flow input, to mate through candidate's word set and the focus word vocabulary that word segmentation processing is obtained, if the candidate word that candidate word is concentrated is included in the focus word vocabulary, this candidate word of then candidate word being concentrated is as the focus candidate word, putting into the focus candidate word concentrates, and add up this focus candidate word in number of times or frequency that candidate word concentrate to occur, namely add up the frequency that appears at the word in the focus word vocabulary behind the participle.

Step 104, the focus candidate word of the predetermined number that selecting frequency is the highest is exported as the focus word.

In this step, the N that frequency is a highest focus candidate word is exported as the focus word.

As seen by above-mentioned, the method for existing excavation focus word needs manual sorting focus word vocabulary, and workload is large; Simultaneously, a large amount of emerging names, place name, mechanism's name may be unregistered words, namely be not organized to focus word vocabulary and include, but these words are the major part of focus incident or theme often again, so that the focus word vocabulary excavation scope that forms based on manual sorting is little, can not excavate this type of focus incident or theme, so that focus word digging efficiency is lower; Further, a lot of focus words, such as the higher word of some frequencies often such as Beijing, film, scandal, because a plurality of events can comprise this word, especially in the microblogging platform, very likely carry Beijing, these words of scandal in online friend's chat conversations secretly, so that these words are mentioned or frequently appearance, but frequent this word that occurs can not reflect a focus incident or topic, that is to say, only rely on the word frequency of occurrences within a certain period of time can not really reflect the temperature of this word; And, the focus word of output is single word, in lacking contextual environment, single word is difficult to reflect a focus incident or topic, for example, focus word for output is the situation of Cote d'lvoire, is lacking under the relevant knowledge background, and the user is difficult to understand event or the topic which focus this word has represented.

Summary of the invention

In view of this, fundamental purpose of the present invention is to propose a kind of method of excavating the focus word, can expand excavation scope, the raising focus word digging efficiency of focus word.

Another object of the present invention is to propose a kind of device that excavates the focus word, can expand excavation scope, the raising focus word digging efficiency of focus word.

For achieving the above object, the invention provides a kind of method of excavating the focus word, the method comprises:

Obtain the text flow of input, text flow is carried out participle, obtain candidate's word set;

The current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data;

The frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.

After obtaining candidate's word set, further comprise:

The inactive vocabulary that sets in advance and candidate's word set of obtaining are mated, concentrate the word that is complementary with inactive vocabulary to filter candidate word.

Described inactive vocabulary comprises: nonsense words and/or, high document rate word.

Described each candidate word each historical frequency in pre-stored historical data of obtaining comprises:

If store each historical frequency of this candidate word in the historical data, read each historical frequency of this candidate word;

If do not store the historical frequency of this candidate word in the historical data, calculate the mean value of each historical frequency of all candidate word of storing in the historical data, as each historical frequency of this candidate word.

The described frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word comprises:

Obtain the mean value of the historical frequency of this candidate word according to each historical frequency of candidate word;

According to the mean value of the historical frequency of each historical frequency of this candidate word and this candidate word of obtaining, calculate the variance of this candidate word;

Obtain the absolute value of difference of the mean value of the current frequency of this candidate word and historical frequency, calculate the merchant of this absolute value and described variance, obtain the frequency anomaly value of this candidate word.

The candidate word of the frequency anomaly of described output predetermined number is:

The word that the candidate word of the frequency anomaly of predetermined number is aggregated into to describe an event or theme bunch is exported.

The word that the candidate word of described frequency anomaly with predetermined number aggregates into to describe an event or theme bunch comprises:

Based on the candidate word of the frequency anomaly of predetermined number, add up the number of times that phrase that per two candidate word form occurs in one text stream;

Add up the number of times that these two candidate word occur respectively in one text stream, and obtain the product of the number of times that these two candidate word occur respectively in one text stream;

Obtain the number of times of described phrase appearance in one text stream and the merchant of described product, as mutual information distance between the point between described two candidate word;

If the mutual information distance value is greater than mutual information distance value threshold value between the point that sets in advance between the point that obtains, two candidate word corresponding to mutual information distance value synthesize a word bunch between then will putting.

Further comprise:

The word that forms based on the candidate word of the frequency anomaly of the predetermined number of selecting or by the candidate word polymerization bunch, triggering is carried out search from the external data source that sets in advance, and candidate word or the word bunch while of the frequency anomaly of Search Results and predetermined number are showed to the user.

A kind of device that excavates the focus word, this device comprises: word-dividing mode, history data store module and frequency anomaly value processing module, wherein,

Word-dividing mode is used for obtaining the text flow of input, and text flow is carried out participle, obtains candidate's word set;

The history data store module is for each historical frequency of storage candidate word;

Frequency anomaly value processing module, be used for the current frequency that the statistics candidate word concentrates each candidate word to occur at text flow, calculate the frequency anomaly value of this candidate word according to each historical frequency of this candidate word of the current frequency of candidate word and history data store module stores, export the current frequency information of the candidate word that calculates to the history data store module, and the candidate word of the frequency anomaly of output predetermined number.

Further comprise:

The denoising module is used for mating with candidate's word set that word-dividing mode is obtained according to the inactive vocabulary that sets in advance, and concentrates the word that is complementary with inactive vocabulary to carry out denoising candidate word.

Further comprise:

Candidate word polymerization module is used for the candidate word of frequency anomaly of the predetermined number of receive frequency abnormality value processing module output, the candidate word of obtaining is aggregated into to describe the word bunch of an event or theme.

Further comprise:

Search module is used for triggering from the data source that sets in advance and carrying out search take the word that obtains bunch or candidate word as searching key word, shows word bunch and Search Results to the user, perhaps, and candidate word and Search Results.

Described frequency anomaly value processing module comprises: current frequency statistics unit, historical frequency average calculation unit, variance computing unit, abnormality value computing unit and candidate word output judging unit, wherein,

Current frequency statistics unit is used for the statistics candidate word and concentrates each candidate word in the current frequency that input text stream occurs, and exports respectively current frequency information to history data store module and abnormality value computing unit;

The historical frequency average calculation unit for the historical frequency of each candidate word that reads the history data store module stores, is calculated the mean value of the historical frequency of each candidate word, exports abnormality value computing unit to;

The variance computing unit, be used for the mean value according to the historical frequency of the historical frequency of each candidate word of history data store module stores and this candidate word that the historical frequency average calculation unit calculates, calculate the variance of each candidate word, export abnormality value computing unit to;

Abnormality value computing unit is used for according to the current frequency of each candidate word, mean value and the variance of historical frequency, calculates respectively the abnormality value of each candidate word;

Candidate word output judging unit is used for the abnormality value is exported displaying greater than the candidate word of the abnormality value threshold value that sets in advance or with the candidate word of the predetermined number of abnormality value maximum.

As seen from the above technical solutions, a kind of method and apparatus that excavates the focus word provided by the invention obtains the text flow of input; Text flow is carried out participle, obtain candidate's word set; The current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data; The frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.Like this, by the historical frequency of concentrated each candidate word of record candidate word, in conjunction with the current frequency of this candidate word, calculate its frequency anomaly value, excavate the focus word by the frequency anomaly value, can expand excavation scope, the raising focus word digging efficiency of focus word.

Description of drawings

Fig. 1 is the existing method flow synoptic diagram that excavates the focus word.

Fig. 2 is the method flow synoptic diagram that the embodiment of the invention is excavated the focus word.

Fig. 3 is the method flow synoptic diagram that the embodiment of the invention extracts the focus word.

Fig. 4 is the method flow synoptic diagram of embodiment of the invention focus word expansion.

Fig. 5 is the apparatus structure synoptic diagram of the excavation focus word of the embodiment of the invention.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with the accompanying drawings and the specific embodiments.

The existing method of excavating the focus word, after candidate's word set and focus word vocabulary mated, the N that frequency is a highest focus candidate word was exported as the focus word.Because the focus word vocabulary update cycle is longer, so that candidate word concentrates more focus word to be filtered by focus word vocabulary, so that the excavation scope of focus word is less, digging efficiency is lower.In the embodiment of the invention, consider and record the historical frequency of concentrated each candidate word of candidate word, in conjunction with the current frequency of this candidate word, calculate its frequency anomaly degree, excavate the focus word by the frequency anomaly degree, so that the focus word that excavates and focus word vocabulary are irrelevant, have expanded the excavation scope of focus word, thereby improved the digging efficiency of focus word.

Fig. 2 is the method flow synoptic diagram that the embodiment of the invention is excavated the focus word.Referring to Fig. 2, this flow process comprises:

Step 201 is obtained the text flow of input;

In the embodiment of the invention, preferably, excavate owing to be based on the historical frequency of candidate word, need to keep the cycle of calculated rate consistent, thereby, the text flow of input can be obtained according to the time cycle that sets in advance, for example, take in the sky as the time cycle, obtain the text flow of input every day.

Step 202 is carried out participle to text flow, obtains candidate's word set;

In this step, the candidate word of obtaining is concentrated, and may comprise a large amount of noises, the words such as " ", " ", " " that for example, include that some are insignificant, this class word to the focus word output has no benefit, be referred to as noise.The nonsense words that comprises for the focus word that reduces last output, in the embodiment of the invention, after obtaining candidate's word set, can carry out denoising to candidate's word set of obtaining according to the inactive vocabulary that sets in advance, namely by inactive vocabulary is set, mate with candidate's word set of obtaining, concentrate the word that is complementary with inactive vocabulary to carry out denoising (filtration) candidate word and process.

As previously mentioned, for for example Beijing, film, a higher focus incident or the word of topic of but can not reflecting of scandal equifrequent, in the embodiment of the invention, further in the vocabulary of stopping using, such word is set, specifically can be by the analysis of extensive text set, screen the high word of a collection of document rate, join in the vocabulary of stopping using, the vocabulary of namely stopping using comprises nonsense words and high document rate word.

Certainly, in the practical application, after candidate's word set after obtaining denoising, candidate's word set after the denoising of obtaining and the focus word that sets in advance vocabulary can also be mated, obtain focus candidate word set, and add up based on this focus candidate word set, like this, can be to obtain exporting more accurately on the basis of sacrificing a part of recall rate.

Step 203, the current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data;

In this step, as previously mentioned, if obtain the text flow of input according to the time cycle that sets in advance, then add up the number of times that candidate word concentrates each candidate word to occur in text flow, this number of times is the current frequency of this candidate word; If obtain at random the text flow of input, then add up the number of times that candidate word concentrates each candidate word to occur in text flow, be scaled number of times corresponding within the time cycle that sets in advance, this corresponding number of times is the current frequency of this candidate word.

Obtaining each candidate word each historical frequency in pre-stored historical data comprises:

Step 204, the frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.

In this step, the formula of calculated rate abnormality value is follow-up to be described in detail again.

As previously mentioned, the focus word of output is single word, and in lacking contextual environment, single word is difficult to reflect a focus incident or topic, lacking under the relevant knowledge background, the user is difficult to understand event or the topic which focus this word has represented.In the embodiment of the invention, output predetermined number frequency anomaly candidate word can for:

The word that the candidate word of the frequency anomaly of predetermined number is aggregated into to describe an event or theme bunch is exported.Wherein, word bunch refers to belong to two or more candidate word of same event or topic, for example, bunch be the situation of " Cote d'lvoire's physical culture " for the output word, even lacking under the relevant knowledge background, the user also can understand bunch representative of this word is event or topic about Cote d'lvoire's physical culture.

Further, in the embodiment of the invention, can also trigger from the external data source that sets in advance and carry out search based on the candidate word of the frequency anomaly of the predetermined number of selecting or word bunch, and candidate word or the word bunch while of the frequency anomaly of Search Results and predetermined number are showed to the user.Like this, the user can understand the particular content of focus incident under the candidate word of displaying or the word bunch or topic in detail, has improved user's experience.

As seen by above-mentioned, the method for the excavation focus word of the embodiment of the invention is obtained the text flow of input; Text flow is carried out participle, obtain candidate's word set; The current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data; The frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.Like this, concentrate the historical frequency of each candidate word by the record candidate word, in conjunction with the current frequency of this candidate word, calculate its frequency anomaly degree, excavate the focus word by the frequency anomaly degree, can expand the focus word the excavation scope, improved focus word digging efficiency; Simultaneously, do not need manual sorting focus word vocabulary, reduced workload; Further, by the vocabulary of stopping using candidate's word set is filtered, avoided frequent appearance but can not reflect the output of the focus word of focus incident or topic; And, by the candidate word with the frequency anomaly of predetermined number aggregate into to describe an event or theme word bunch and/or, trigger from the external data source that sets in advance based on word bunch or candidate word and to carry out search, and candidate word or the word bunch while of the frequency anomaly of Search Results and predetermined number showed to the user, the user is being lacked under the relevant knowledge background, understand in detail the particular content of focus incident under the candidate word of showing or the word bunch or topic, improved user's experience.

By foregoing description as seen, the method for the excavation focus word of the embodiment of the invention from step 201 to step 204, mainly is the extraction flow process of focus word, and word bunch synthetic, search then is the expansion flow process of focus word, and the below is specifically described respectively.

Fig. 3 is the method flow synoptic diagram that the embodiment of the invention extracts the focus word.Referring to Fig. 3, this flow process comprises:

Step 301 is obtained the text flow of input;

Step 302 is carried out participle to text flow, obtains candidate's word set;

Step 301,302 respectively with step 101,102 identical.

Step 303 is carried out denoising according to the inactive vocabulary that sets in advance to candidate's word set of obtaining;

In this step, inactive vocabulary comprise nonsense words and/or, high document rate word.

Step 304, the current frequency that the candidate word after the statistics denoising concentrates each candidate word to occur;

In this step, add up the current frequency that each candidate word occurs after, this current frequency information that counts exported in the historical data stores.

Step 305 is obtained each candidate word each historical frequency in pre-stored historical data;

In this step, historical frequency is consistent with the unit of current frequency, if current cps and historical frequency unit are inconsistent, then current cps is scaled consistent with historical frequency unit.

Step 306 according to current frequency and historical frequency, is obtained candidate word and the output of the predetermined number of frequency anomaly.

In this step, find out the most outstanding N of a frequency anomaly word and export as the focus word.

Gaussian distribution (normal distribution) is a kind of probability distribution of modal continuous random variable, has two parameter μ and σ ², parameter μ is the average of the stochastic variable of Normal Distribution, parameter σ ²Be this variance of a random variable, be denoted as N (μ, σ ²).

Suppose that candidate word satisfies Gaussian distribution, like this, to each candidate word, can obtain by the frequency that this candidate word of statistics occurs in each unit interval section (time cycle) average of Gaussian distribution in historical data, then calculate the variance of Gaussian distribution according to the mode of maximal possibility estimation, computation of mean values is specific as follows:

If μ _iThe frequency that in i unit interval section, occurs for candidate word, i.e. i historical frequency, then average (mean value of each historical frequency) μ of Gaussian distribution corresponding to this candidate word is:

μ = \frac{1}{n} Σ_{i = 1}^{n} μ_{i}

In the formula,

N is the unit interval hop count of statistics.

The variance that calculates Gaussian distribution according to the mode of maximal possibility estimation specifically can referring to the correlation technique document, not repeat them here.For instance, if be the unit interval section take the sky, establishing " Beijing " word average frequency that occur every day in the microblogging data is 5.7e ^-4, variance is 1.4e ^-5, can think that then in the unit interval section of " Beijing " word in historical data, satisfying average is 5.7e ^-4, variance is 1.4e ^-5Gaussian distribution.

After obtaining the Gaussian distribution situation of each candidate word in historical data, calculate " the abnormality value " of each candidate word in the unit interval section with following formula.

S = \frac{| f - μ |}{σ^{2}}

In the formula,

S is the abnormality value of candidate word in the unit interval section;

F is the current frequency of candidate word;

μ is the average of candidate word Gaussian distribution in historical data, i.e. the mean value of each historical frequency;

σ ²Variance for candidate word Gaussian distribution in historical data.

For the neologisms that in historical data, do not occur, in the embodiment of the invention, as previously mentioned, provide average and the variance of these neologisms Gaussian distribution in historical data with level and smooth strategy.Its average is the average of the mean value of the corresponding historical frequency of all words in the historical data, and variance is the average of all word variances in the historical data.

Predetermined number can determine as required that for example, the candidate word after denoising is concentrated, and finds out N the most outstanding candidate word of frequency anomaly, and namely the N of an abnormality value maximum candidate word is as follow-up set of words.

In the practical application, carry out parameter estimation with the account form of determining the frequency anomaly value except using the mode modeling based on Gaussian distribution, the mode of use maximal possibility estimation, can also determine whether the frequency of candidate word is unusual with other location mode and parameter estimation mode, for example, χ ²Distribute and method for parameter estimation.Certainly, because word frequency is to weigh the basic calculating mode of a word, also can use the mode based on word frequency, for example, come the abnormality value of calculated candidate word by following formula:

S′＝tfxIDF

In the formula,

S ' is the abnormality value of candidate word;

Tf is the word frequency of candidate word;

IDF is the contrary document rate of candidate word.

Fig. 4 is the method flow synoptic diagram of embodiment of the invention focus word expansion.Referring to Fig. 4, this flow process comprises:

Step 401 is obtained the candidate word of the predetermined number of frequency anomaly, the candidate word of obtaining is aggregated into to describe the word bunch of an event or theme;

This step comprises focus set of words and the polymerization of focus word, the focus set of words is namely obtained the candidate word of the predetermined number of frequency anomaly, the candidate word that the polymerization of focus word is about to obtain aggregates into to describe the word bunch of an event or theme by clustering algorithm, and clustering algorithm can be the K method of average (K-means), Once-clustering (Single-pass clustering), solidify cluster (Agglomerative clustering), spectral clustering scheduling algorithm.Candidate word in each word bunch can be in order to describe an event or theme, to per two candidate word, add up the number of times of phrase appearance in one text stream (document) of these two candidate word compositions, and mutual information (PMI between the use point, Pointwise Mutual Information) mode is calculated the distance between these two candidate word, in order to form word bunch, the computing formula of PMI is:

S_{PMI} = \frac{N_{AB}}{N_{A} x N_{B}}

In the formula,

S _PMIBe the PMI distance value between candidate word A and the candidate word B;

N _ABThe number of times that the phrase that forms for candidate word A and candidate word B occurs in one text stream;

N _ANumber of times for candidate word A appearance in this one text stream;

N _BNumber of times for candidate word B appearance in this one text stream.

For example, if A, B two candidate word occur respectively 30 times and 20 times in one text stream, the number of times that the phrase that A, B two candidate word form occurs in text stream simultaneously is 6 times, and then the PMI distance value between candidate word A and the candidate word B is 6/ (30*20)=0.01.

If the PMI distance value that calculates is greater than the PMI distance value threshold value that sets in advance, two candidate word that then this PMI distance value is corresponding belong to same event or topic, thereby, can synthesize a word bunch.Certainly, also can be described focus by maximally related some phrases, sentence, whole section text in the extraction microblogging.

In the practical application, judge except utilizing the PMI distance value whether two corresponding candidate word belong to same event or the topic, can also use Pearson's coefficient (Pearson Coefficient), Chi-square Test (Chi Square), Cos distance (Cosine Similarity), the equidistant computing formula of Jie Kade distance (Jaccard Distance) is calculated the distance between two candidate word, and whether belong to same event or topic based on these two candidate word of Distance Judgment of calculating, for example, each candidate word is carried out semantic extension, for instance, suppose " motor-car ", " Wenzhou " is two candidate word, semantic extension is specific as follows: all and " motor-car " in the interior microblogging of statistics predetermined amount of time, the word of " Wenzhou " co-occurrence, then, make up vector (tfxIDF) by the word frequency tf of candidate word and the contrary document rate IDF of candidate word, like this, the distance between " motor-car " and " Wenzhou " just can be calculated by asking the Cos distance between these two vectors.In the practical application, above-mentioned mention also can by the weight formula that sets in advance, merge into a value based on the value of mutual information and the distance of semantic extension between point.Certainly, can also use more general cluster, hierarchical clustering scheduling algorithm that word is carried out polymerization and form the word bunch that comprises more candidate word, about by general cluster, hierarchical clustering scheduling algorithm word being carried out polymerization, specifically can referring to the correlation technique document, not repeat them here.

Through the polymerization of heat spot word, be output as based on the word of event or topic bunch.Yet, lacking under context environmental and the background knowledge, the word of output bunch may still be difficult to allow the people understand.Thereby, in the embodiment of the invention, further execution in step 402.

Step 402 take the word that obtains bunch as searching key word, triggers from the data source that sets in advance and carries out search, shows word bunch and Search Results to the user.

In this step, through the candidate word general proxy of heat spot word polymerization certain focus incident, in order to present event overview more directly perceived, more high-quality to the user, can further carry out replenishing of information and perfect to the event of mating based on the method for focus word polymerization, browse to make things convenient for the user.The word that obtains bunch can be searched in original text flow, and the identical text that search obtains is gone heavily.In addition, these words bunch can also be searched in the external data sources such as news, search daily record, and relevant Search Results is integrated, form at last focus incident/topic, and be illustrated in the final Output rusults.In the practical application, can also use the more external data source such as encyclopaedia data, picture to carry out search, and Output rusults expanded, can also integrate the content in the microblogging based on the focus word, find out the microblogging of describing relevant focus incident and show, with respect to existing search technique, the method search accuracy rate is higher, correlativity is stronger.

For example, " Libya, truce " for aggregate in first link (step 401) in order to the word of describing an event or theme bunch; Next, in second link (step 402), with Libya, truce " search in the communities such as microblogging as search query word, and the text that repeats is filtered; Simultaneously, in the search daily record, search for, obtain words such as " the Libyan War opposition faction U.S. " and repeatedly appear at together in the recent period; Then, in news, use " Libya's truce " to search for, obtain the news about Libya's truce event; At last, with the Search Results of above-mentioned filtration, repeatedly appear at the information such as news related heading that coordinate indexing word together and search obtain, summary, picture and show in the lump the user.Certainly, in the practical application, search for and the Search Results displaying is got final product for one that also can only carry out wherein.

Fig. 5 is the apparatus structure synoptic diagram of the excavation focus word of the embodiment of the invention.Referring to Fig. 5, this device comprises: word-dividing mode, history data store module and frequency anomaly value processing module, wherein,

Preferably, this device further comprises:

This device also comprises:

Wherein, frequency anomaly value processing module comprises: current frequency statistics unit, historical frequency average calculation unit, variance computing unit, abnormality value computing unit and candidate word output judging unit (not shown), wherein,

Current frequency statistics unit is used for the statistics candidate word and concentrates each candidate word in the current frequency that input text stream occurs, and exports respectively current frequency to history data store module and abnormality value computing unit;

The above is preferred embodiment of the present invention only, is not for limiting protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of doing, be equal to and replace and improvement etc., all should be included within protection scope of the present invention.

Claims

1. method of excavating the focus word is characterized in that the method comprises:

2. the method for claim 1 is characterized in that, after obtaining candidate's word set, further comprises:

3. method as claimed in claim 2 is characterized in that, described inactive vocabulary comprises: nonsense words and/or, high document rate word.

4. the method for claim 1 is characterized in that, described each candidate word each historical frequency in pre-stored historical data of obtaining comprises:

5. the method for claim 1 is characterized in that, the described frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word comprises:

6. such as each described method of claim 1 to 5, it is characterized in that the candidate word of the frequency anomaly of described output predetermined number is:

7. method as claimed in claim 6 is characterized in that, the word that the candidate word of described frequency anomaly with predetermined number aggregates into to describe an event or theme bunch comprises:

8. such as each described method of claim 1 to 5, it is characterized in that, further comprise:

9. a device that excavates the focus word is characterized in that, this device comprises: word-dividing mode, history data store module and frequency anomaly value processing module, wherein,

10. device as claimed in claim 9 is characterized in that, further comprises:

11. such as claim 9 or 10 described devices, it is characterized in that, further comprise:

12. device as claimed in claim 11 is characterized in that, further comprises:

13. device as claimed in claim 12, it is characterized in that, described frequency anomaly value processing module comprises: current frequency statistics unit, historical frequency average calculation unit, variance computing unit, abnormality value computing unit and candidate word output judging unit, wherein