CN103049443A - Method and device for mining hot-spot words - Google Patents

Method and device for mining hot-spot words Download PDF

Info

Publication number
CN103049443A
CN103049443A CN2011103078466A CN201110307846A CN103049443A CN 103049443 A CN103049443 A CN 103049443A CN 2011103078466 A CN2011103078466 A CN 2011103078466A CN 201110307846 A CN201110307846 A CN 201110307846A CN 103049443 A CN103049443 A CN 103049443A
Authority
CN
China
Prior art keywords
word
candidate word
frequency
candidate
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103078466A
Other languages
Chinese (zh)
Inventor
罗侃
陈洪亮
杨志峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2011103078466A priority Critical patent/CN103049443A/en
Publication of CN103049443A publication Critical patent/CN103049443A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method and a device for mining hot-spot words. The method includes acquiring input text streams; subjecting the text streams to word segmentation to obtain a candidate word set; accounting a current frequency that each candidate word in the candidate word set appears in the text streams to acquire each history frequency of each candidate word in prestored history data; and calculating a frequency abnormity value of the candidate word according to the current frequency and each history frequency of the candidate word, storing a current frequency message of the candidate word in the history data, and outputting a preset number of candidate words with abnormal frequencies. By means of the method and the device for mining the hot-spot words, mining ranges of the hot-spot words can be extended, and mining efficiencies of the hot-spot words can be improved.

Description

A kind of method and apparatus that excavates the focus word
Technical field
The present invention relates to computer communication technology, particularly a kind of method and apparatus that excavates the focus word.
Background technology
Development along with computer communication technology, especially the development of 3g network and intelligent mobile terminal, user's the network life is more and more abundanter, in network chat, browse news, see a film, play games, search for, do shopping, release news etc., more and more becomes the part of the network life.For example, microblogging visitor (MicroBlog), namely microblogging as an Information Sharing based on customer relationship, propagate and obtain platform, the user can be set up individual community by WEB, WAP and various client, with the literal lastest imformation about 140 words, and realize immediately sharing.
Because Web content is abundant, it is also more and more that the network user therefrom obtains the time that relevant information spends, for the network that improves the user is experienced, the method that each operator excavates by the focus word, the time news that automatic acquisition is up-to-date, in time recommend to the network user, for example, text flow information according to the microblogging input, automatically the focus word that wherein comprises of identification, and recommend hot information to the user who pays close attention to, like this, when promoting network service, also effectively reduce the user and obtain the required time of hot information.
Fig. 1 is the existing method flow synoptic diagram that excavates the focus word.Referring to Fig. 1, this flow process comprises:
Step 101 is obtained the text flow of input;
In this step, process by the content that webpage, microblogging are comprised, obtain webpage, text flow corresponding to microblogging content, text flow can obtain according to the predefined time cycle, also can obtain at random.
Step 102 is carried out participle to text flow, obtains candidate's word set;
In this step, text flow is carried out participle obtain the word that comprises in the text flow, specifically can be referring to the correlation technique document.
Step 103 is mated candidate's word set of obtaining and the focus word vocabulary that sets in advance, and obtains focus candidate word set, and the frequency of statistics focus candidate word;
In this step, can put in advance, collect the word to be paid close attention to that may comprise in a collection of focus incident in order in artificial mode, words such as earthquake, fire, speech, accident, Beijing, tourism, shopping forms focus word vocabulary.
After the text flow input, to mate through candidate's word set and the focus word vocabulary that word segmentation processing is obtained, if the candidate word that candidate word is concentrated is included in the focus word vocabulary, this candidate word of then candidate word being concentrated is as the focus candidate word, putting into the focus candidate word concentrates, and add up this focus candidate word in number of times or frequency that candidate word concentrate to occur, namely add up the frequency that appears at the word in the focus word vocabulary behind the participle.
Step 104, the focus candidate word of the predetermined number that selecting frequency is the highest is exported as the focus word.
In this step, the N that frequency is a highest focus candidate word is exported as the focus word.
As seen by above-mentioned, the method for existing excavation focus word needs manual sorting focus word vocabulary, and workload is large; Simultaneously, a large amount of emerging names, place name, mechanism's name may be unregistered words, namely be not organized to focus word vocabulary and include, but these words are the major part of focus incident or theme often again, so that the focus word vocabulary excavation scope that forms based on manual sorting is little, can not excavate this type of focus incident or theme, so that focus word digging efficiency is lower; Further, a lot of focus words, such as the higher word of some frequencies often such as Beijing, film, scandal, because a plurality of events can comprise this word, especially in the microblogging platform, very likely carry Beijing, these words of scandal in online friend's chat conversations secretly, so that these words are mentioned or frequently appearance, but frequent this word that occurs can not reflect a focus incident or topic, that is to say, only rely on the word frequency of occurrences within a certain period of time can not really reflect the temperature of this word; And, the focus word of output is single word, in lacking contextual environment, single word is difficult to reflect a focus incident or topic, for example, focus word for output is the situation of Cote d'lvoire, is lacking under the relevant knowledge background, and the user is difficult to understand event or the topic which focus this word has represented.
Summary of the invention
In view of this, fundamental purpose of the present invention is to propose a kind of method of excavating the focus word, can expand excavation scope, the raising focus word digging efficiency of focus word.
Another object of the present invention is to propose a kind of device that excavates the focus word, can expand excavation scope, the raising focus word digging efficiency of focus word.
For achieving the above object, the invention provides a kind of method of excavating the focus word, the method comprises:
Obtain the text flow of input, text flow is carried out participle, obtain candidate's word set;
The current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data;
The frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.
After obtaining candidate's word set, further comprise:
The inactive vocabulary that sets in advance and candidate's word set of obtaining are mated, concentrate the word that is complementary with inactive vocabulary to filter candidate word.
Described inactive vocabulary comprises: nonsense words and/or, high document rate word.
Described each candidate word each historical frequency in pre-stored historical data of obtaining comprises:
If store each historical frequency of this candidate word in the historical data, read each historical frequency of this candidate word;
If do not store the historical frequency of this candidate word in the historical data, calculate the mean value of each historical frequency of all candidate word of storing in the historical data, as each historical frequency of this candidate word.
The described frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word comprises:
Obtain the mean value of the historical frequency of this candidate word according to each historical frequency of candidate word;
According to the mean value of the historical frequency of each historical frequency of this candidate word and this candidate word of obtaining, calculate the variance of this candidate word;
Obtain the absolute value of difference of the mean value of the current frequency of this candidate word and historical frequency, calculate the merchant of this absolute value and described variance, obtain the frequency anomaly value of this candidate word.
The candidate word of the frequency anomaly of described output predetermined number is:
The word that the candidate word of the frequency anomaly of predetermined number is aggregated into to describe an event or theme bunch is exported.
The word that the candidate word of described frequency anomaly with predetermined number aggregates into to describe an event or theme bunch comprises:
Based on the candidate word of the frequency anomaly of predetermined number, add up the number of times that phrase that per two candidate word form occurs in one text stream;
Add up the number of times that these two candidate word occur respectively in one text stream, and obtain the product of the number of times that these two candidate word occur respectively in one text stream;
Obtain the number of times of described phrase appearance in one text stream and the merchant of described product, as mutual information distance between the point between described two candidate word;
If the mutual information distance value is greater than mutual information distance value threshold value between the point that sets in advance between the point that obtains, two candidate word corresponding to mutual information distance value synthesize a word bunch between then will putting.
Further comprise:
The word that forms based on the candidate word of the frequency anomaly of the predetermined number of selecting or by the candidate word polymerization bunch, triggering is carried out search from the external data source that sets in advance, and candidate word or the word bunch while of the frequency anomaly of Search Results and predetermined number are showed to the user.
A kind of device that excavates the focus word, this device comprises: word-dividing mode, history data store module and frequency anomaly value processing module, wherein,
Word-dividing mode is used for obtaining the text flow of input, and text flow is carried out participle, obtains candidate's word set;
The history data store module is for each historical frequency of storage candidate word;
Frequency anomaly value processing module, be used for the current frequency that the statistics candidate word concentrates each candidate word to occur at text flow, calculate the frequency anomaly value of this candidate word according to each historical frequency of this candidate word of the current frequency of candidate word and history data store module stores, export the current frequency information of the candidate word that calculates to the history data store module, and the candidate word of the frequency anomaly of output predetermined number.
Further comprise:
The denoising module is used for mating with candidate's word set that word-dividing mode is obtained according to the inactive vocabulary that sets in advance, and concentrates the word that is complementary with inactive vocabulary to carry out denoising candidate word.
Further comprise:
Candidate word polymerization module is used for the candidate word of frequency anomaly of the predetermined number of receive frequency abnormality value processing module output, the candidate word of obtaining is aggregated into to describe the word bunch of an event or theme.
Further comprise:
Search module is used for triggering from the data source that sets in advance and carrying out search take the word that obtains bunch or candidate word as searching key word, shows word bunch and Search Results to the user, perhaps, and candidate word and Search Results.
Described frequency anomaly value processing module comprises: current frequency statistics unit, historical frequency average calculation unit, variance computing unit, abnormality value computing unit and candidate word output judging unit, wherein,
Current frequency statistics unit is used for the statistics candidate word and concentrates each candidate word in the current frequency that input text stream occurs, and exports respectively current frequency information to history data store module and abnormality value computing unit;
The historical frequency average calculation unit for the historical frequency of each candidate word that reads the history data store module stores, is calculated the mean value of the historical frequency of each candidate word, exports abnormality value computing unit to;
The variance computing unit, be used for the mean value according to the historical frequency of the historical frequency of each candidate word of history data store module stores and this candidate word that the historical frequency average calculation unit calculates, calculate the variance of each candidate word, export abnormality value computing unit to;
Abnormality value computing unit is used for according to the current frequency of each candidate word, mean value and the variance of historical frequency, calculates respectively the abnormality value of each candidate word;
Candidate word output judging unit is used for the abnormality value is exported displaying greater than the candidate word of the abnormality value threshold value that sets in advance or with the candidate word of the predetermined number of abnormality value maximum.
As seen from the above technical solutions, a kind of method and apparatus that excavates the focus word provided by the invention obtains the text flow of input; Text flow is carried out participle, obtain candidate's word set; The current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data; The frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.Like this, by the historical frequency of concentrated each candidate word of record candidate word, in conjunction with the current frequency of this candidate word, calculate its frequency anomaly value, excavate the focus word by the frequency anomaly value, can expand excavation scope, the raising focus word digging efficiency of focus word.
Description of drawings
Fig. 1 is the existing method flow synoptic diagram that excavates the focus word.
Fig. 2 is the method flow synoptic diagram that the embodiment of the invention is excavated the focus word.
Fig. 3 is the method flow synoptic diagram that the embodiment of the invention extracts the focus word.
Fig. 4 is the method flow synoptic diagram of embodiment of the invention focus word expansion.
Fig. 5 is the apparatus structure synoptic diagram of the excavation focus word of the embodiment of the invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with the accompanying drawings and the specific embodiments.
The existing method of excavating the focus word, after candidate's word set and focus word vocabulary mated, the N that frequency is a highest focus candidate word was exported as the focus word.Because the focus word vocabulary update cycle is longer, so that candidate word concentrates more focus word to be filtered by focus word vocabulary, so that the excavation scope of focus word is less, digging efficiency is lower.In the embodiment of the invention, consider and record the historical frequency of concentrated each candidate word of candidate word, in conjunction with the current frequency of this candidate word, calculate its frequency anomaly degree, excavate the focus word by the frequency anomaly degree, so that the focus word that excavates and focus word vocabulary are irrelevant, have expanded the excavation scope of focus word, thereby improved the digging efficiency of focus word.
Fig. 2 is the method flow synoptic diagram that the embodiment of the invention is excavated the focus word.Referring to Fig. 2, this flow process comprises:
Step 201 is obtained the text flow of input;
In the embodiment of the invention, preferably, excavate owing to be based on the historical frequency of candidate word, need to keep the cycle of calculated rate consistent, thereby, the text flow of input can be obtained according to the time cycle that sets in advance, for example, take in the sky as the time cycle, obtain the text flow of input every day.
Step 202 is carried out participle to text flow, obtains candidate's word set;
In this step, the candidate word of obtaining is concentrated, and may comprise a large amount of noises, the words such as " ", " ", " " that for example, include that some are insignificant, this class word to the focus word output has no benefit, be referred to as noise.The nonsense words that comprises for the focus word that reduces last output, in the embodiment of the invention, after obtaining candidate's word set, can carry out denoising to candidate's word set of obtaining according to the inactive vocabulary that sets in advance, namely by inactive vocabulary is set, mate with candidate's word set of obtaining, concentrate the word that is complementary with inactive vocabulary to carry out denoising (filtration) candidate word and process.
As previously mentioned, for for example Beijing, film, a higher focus incident or the word of topic of but can not reflecting of scandal equifrequent, in the embodiment of the invention, further in the vocabulary of stopping using, such word is set, specifically can be by the analysis of extensive text set, screen the high word of a collection of document rate, join in the vocabulary of stopping using, the vocabulary of namely stopping using comprises nonsense words and high document rate word.
Certainly, in the practical application, after candidate's word set after obtaining denoising, candidate's word set after the denoising of obtaining and the focus word that sets in advance vocabulary can also be mated, obtain focus candidate word set, and add up based on this focus candidate word set, like this, can be to obtain exporting more accurately on the basis of sacrificing a part of recall rate.
Step 203, the current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data;
In this step, as previously mentioned, if obtain the text flow of input according to the time cycle that sets in advance, then add up the number of times that candidate word concentrates each candidate word to occur in text flow, this number of times is the current frequency of this candidate word; If obtain at random the text flow of input, then add up the number of times that candidate word concentrates each candidate word to occur in text flow, be scaled number of times corresponding within the time cycle that sets in advance, this corresponding number of times is the current frequency of this candidate word.
Obtaining each candidate word each historical frequency in pre-stored historical data comprises:
If store each historical frequency of this candidate word in the historical data, read each historical frequency of this candidate word;
If do not store the historical frequency of this candidate word in the historical data, calculate the mean value of each historical frequency of all candidate word of storing in the historical data, as each historical frequency of this candidate word.
Step 204, the frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.
In this step, the formula of calculated rate abnormality value is follow-up to be described in detail again.
As previously mentioned, the focus word of output is single word, and in lacking contextual environment, single word is difficult to reflect a focus incident or topic, lacking under the relevant knowledge background, the user is difficult to understand event or the topic which focus this word has represented.In the embodiment of the invention, output predetermined number frequency anomaly candidate word can for:
The word that the candidate word of the frequency anomaly of predetermined number is aggregated into to describe an event or theme bunch is exported.Wherein, word bunch refers to belong to two or more candidate word of same event or topic, for example, bunch be the situation of " Cote d'lvoire's physical culture " for the output word, even lacking under the relevant knowledge background, the user also can understand bunch representative of this word is event or topic about Cote d'lvoire's physical culture.
Further, in the embodiment of the invention, can also trigger from the external data source that sets in advance and carry out search based on the candidate word of the frequency anomaly of the predetermined number of selecting or word bunch, and candidate word or the word bunch while of the frequency anomaly of Search Results and predetermined number are showed to the user.Like this, the user can understand the particular content of focus incident under the candidate word of displaying or the word bunch or topic in detail, has improved user's experience.
As seen by above-mentioned, the method for the excavation focus word of the embodiment of the invention is obtained the text flow of input; Text flow is carried out participle, obtain candidate's word set; The current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data; The frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.Like this, concentrate the historical frequency of each candidate word by the record candidate word, in conjunction with the current frequency of this candidate word, calculate its frequency anomaly degree, excavate the focus word by the frequency anomaly degree, can expand the focus word the excavation scope, improved focus word digging efficiency; Simultaneously, do not need manual sorting focus word vocabulary, reduced workload; Further, by the vocabulary of stopping using candidate's word set is filtered, avoided frequent appearance but can not reflect the output of the focus word of focus incident or topic; And, by the candidate word with the frequency anomaly of predetermined number aggregate into to describe an event or theme word bunch and/or, trigger from the external data source that sets in advance based on word bunch or candidate word and to carry out search, and candidate word or the word bunch while of the frequency anomaly of Search Results and predetermined number showed to the user, the user is being lacked under the relevant knowledge background, understand in detail the particular content of focus incident under the candidate word of showing or the word bunch or topic, improved user's experience.
By foregoing description as seen, the method for the excavation focus word of the embodiment of the invention from step 201 to step 204, mainly is the extraction flow process of focus word, and word bunch synthetic, search then is the expansion flow process of focus word, and the below is specifically described respectively.
Fig. 3 is the method flow synoptic diagram that the embodiment of the invention extracts the focus word.Referring to Fig. 3, this flow process comprises:
Step 301 is obtained the text flow of input;
Step 302 is carried out participle to text flow, obtains candidate's word set;
Step 301,302 respectively with step 101,102 identical.
Step 303 is carried out denoising according to the inactive vocabulary that sets in advance to candidate's word set of obtaining;
In this step, inactive vocabulary comprise nonsense words and/or, high document rate word.
Step 304, the current frequency that the candidate word after the statistics denoising concentrates each candidate word to occur;
In this step, add up the current frequency that each candidate word occurs after, this current frequency information that counts exported in the historical data stores.
Step 305 is obtained each candidate word each historical frequency in pre-stored historical data;
In this step, historical frequency is consistent with the unit of current frequency, if current cps and historical frequency unit are inconsistent, then current cps is scaled consistent with historical frequency unit.
Step 306 according to current frequency and historical frequency, is obtained candidate word and the output of the predetermined number of frequency anomaly.
In this step, find out the most outstanding N of a frequency anomaly word and export as the focus word.
Gaussian distribution (normal distribution) is a kind of probability distribution of modal continuous random variable, has two parameter μ and σ 2, parameter μ is the average of the stochastic variable of Normal Distribution, parameter σ 2Be this variance of a random variable, be denoted as N (μ, σ 2).
Suppose that candidate word satisfies Gaussian distribution, like this, to each candidate word, can obtain by the frequency that this candidate word of statistics occurs in each unit interval section (time cycle) average of Gaussian distribution in historical data, then calculate the variance of Gaussian distribution according to the mode of maximal possibility estimation, computation of mean values is specific as follows:
If μ iThe frequency that in i unit interval section, occurs for candidate word, i.e. i historical frequency, then average (mean value of each historical frequency) μ of Gaussian distribution corresponding to this candidate word is:
μ = 1 n Σ i = 1 n μ i
In the formula,
N is the unit interval hop count of statistics.
The variance that calculates Gaussian distribution according to the mode of maximal possibility estimation specifically can referring to the correlation technique document, not repeat them here.For instance, if be the unit interval section take the sky, establishing " Beijing " word average frequency that occur every day in the microblogging data is 5.7e -4, variance is 1.4e -5, can think that then in the unit interval section of " Beijing " word in historical data, satisfying average is 5.7e -4, variance is 1.4e -5Gaussian distribution.
After obtaining the Gaussian distribution situation of each candidate word in historical data, calculate " the abnormality value " of each candidate word in the unit interval section with following formula.
S = | f - μ | σ 2
In the formula,
S is the abnormality value of candidate word in the unit interval section;
F is the current frequency of candidate word;
μ is the average of candidate word Gaussian distribution in historical data, i.e. the mean value of each historical frequency;
σ 2Variance for candidate word Gaussian distribution in historical data.
For the neologisms that in historical data, do not occur, in the embodiment of the invention, as previously mentioned, provide average and the variance of these neologisms Gaussian distribution in historical data with level and smooth strategy.Its average is the average of the mean value of the corresponding historical frequency of all words in the historical data, and variance is the average of all word variances in the historical data.
Predetermined number can determine as required that for example, the candidate word after denoising is concentrated, and finds out N the most outstanding candidate word of frequency anomaly, and namely the N of an abnormality value maximum candidate word is as follow-up set of words.
In the practical application, carry out parameter estimation with the account form of determining the frequency anomaly value except using the mode modeling based on Gaussian distribution, the mode of use maximal possibility estimation, can also determine whether the frequency of candidate word is unusual with other location mode and parameter estimation mode, for example, χ 2Distribute and method for parameter estimation.Certainly, because word frequency is to weigh the basic calculating mode of a word, also can use the mode based on word frequency, for example, come the abnormality value of calculated candidate word by following formula:
S′=tfxIDF
In the formula,
S ' is the abnormality value of candidate word;
Tf is the word frequency of candidate word;
IDF is the contrary document rate of candidate word.
Fig. 4 is the method flow synoptic diagram of embodiment of the invention focus word expansion.Referring to Fig. 4, this flow process comprises:
Step 401 is obtained the candidate word of the predetermined number of frequency anomaly, the candidate word of obtaining is aggregated into to describe the word bunch of an event or theme;
This step comprises focus set of words and the polymerization of focus word, the focus set of words is namely obtained the candidate word of the predetermined number of frequency anomaly, the candidate word that the polymerization of focus word is about to obtain aggregates into to describe the word bunch of an event or theme by clustering algorithm, and clustering algorithm can be the K method of average (K-means), Once-clustering (Single-pass clustering), solidify cluster (Agglomerative clustering), spectral clustering scheduling algorithm.Candidate word in each word bunch can be in order to describe an event or theme, to per two candidate word, add up the number of times of phrase appearance in one text stream (document) of these two candidate word compositions, and mutual information (PMI between the use point, Pointwise Mutual Information) mode is calculated the distance between these two candidate word, in order to form word bunch, the computing formula of PMI is:
S PMI = N AB N A x N B
In the formula,
S PMIBe the PMI distance value between candidate word A and the candidate word B;
N ABThe number of times that the phrase that forms for candidate word A and candidate word B occurs in one text stream;
N ANumber of times for candidate word A appearance in this one text stream;
N BNumber of times for candidate word B appearance in this one text stream.
For example, if A, B two candidate word occur respectively 30 times and 20 times in one text stream, the number of times that the phrase that A, B two candidate word form occurs in text stream simultaneously is 6 times, and then the PMI distance value between candidate word A and the candidate word B is 6/ (30*20)=0.01.
If the PMI distance value that calculates is greater than the PMI distance value threshold value that sets in advance, two candidate word that then this PMI distance value is corresponding belong to same event or topic, thereby, can synthesize a word bunch.Certainly, also can be described focus by maximally related some phrases, sentence, whole section text in the extraction microblogging.
In the practical application, judge except utilizing the PMI distance value whether two corresponding candidate word belong to same event or the topic, can also use Pearson's coefficient (Pearson Coefficient), Chi-square Test (Chi Square), Cos distance (Cosine Similarity), the equidistant computing formula of Jie Kade distance (Jaccard Distance) is calculated the distance between two candidate word, and whether belong to same event or topic based on these two candidate word of Distance Judgment of calculating, for example, each candidate word is carried out semantic extension, for instance, suppose " motor-car ", " Wenzhou " is two candidate word, semantic extension is specific as follows: all and " motor-car " in the interior microblogging of statistics predetermined amount of time, the word of " Wenzhou " co-occurrence, then, make up vector (tfxIDF) by the word frequency tf of candidate word and the contrary document rate IDF of candidate word, like this, the distance between " motor-car " and " Wenzhou " just can be calculated by asking the Cos distance between these two vectors.In the practical application, above-mentioned mention also can by the weight formula that sets in advance, merge into a value based on the value of mutual information and the distance of semantic extension between point.Certainly, can also use more general cluster, hierarchical clustering scheduling algorithm that word is carried out polymerization and form the word bunch that comprises more candidate word, about by general cluster, hierarchical clustering scheduling algorithm word being carried out polymerization, specifically can referring to the correlation technique document, not repeat them here.
Through the polymerization of heat spot word, be output as based on the word of event or topic bunch.Yet, lacking under context environmental and the background knowledge, the word of output bunch may still be difficult to allow the people understand.Thereby, in the embodiment of the invention, further execution in step 402.
Step 402 take the word that obtains bunch as searching key word, triggers from the data source that sets in advance and carries out search, shows word bunch and Search Results to the user.
In this step, through the candidate word general proxy of heat spot word polymerization certain focus incident, in order to present event overview more directly perceived, more high-quality to the user, can further carry out replenishing of information and perfect to the event of mating based on the method for focus word polymerization, browse to make things convenient for the user.The word that obtains bunch can be searched in original text flow, and the identical text that search obtains is gone heavily.In addition, these words bunch can also be searched in the external data sources such as news, search daily record, and relevant Search Results is integrated, form at last focus incident/topic, and be illustrated in the final Output rusults.In the practical application, can also use the more external data source such as encyclopaedia data, picture to carry out search, and Output rusults expanded, can also integrate the content in the microblogging based on the focus word, find out the microblogging of describing relevant focus incident and show, with respect to existing search technique, the method search accuracy rate is higher, correlativity is stronger.
For example, " Libya, truce " for aggregate in first link (step 401) in order to the word of describing an event or theme bunch; Next, in second link (step 402), with Libya, truce " search in the communities such as microblogging as search query word, and the text that repeats is filtered; Simultaneously, in the search daily record, search for, obtain words such as " the Libyan War opposition faction U.S. " and repeatedly appear at together in the recent period; Then, in news, use " Libya's truce " to search for, obtain the news about Libya's truce event; At last, with the Search Results of above-mentioned filtration, repeatedly appear at the information such as news related heading that coordinate indexing word together and search obtain, summary, picture and show in the lump the user.Certainly, in the practical application, search for and the Search Results displaying is got final product for one that also can only carry out wherein.
Fig. 5 is the apparatus structure synoptic diagram of the excavation focus word of the embodiment of the invention.Referring to Fig. 5, this device comprises: word-dividing mode, history data store module and frequency anomaly value processing module, wherein,
Word-dividing mode is used for obtaining the text flow of input, and text flow is carried out participle, obtains candidate's word set;
The history data store module is for each historical frequency of storage candidate word;
Frequency anomaly value processing module, be used for the current frequency that the statistics candidate word concentrates each candidate word to occur at text flow, calculate the frequency anomaly value of this candidate word according to each historical frequency of this candidate word of the current frequency of candidate word and history data store module stores, export the current frequency information of the candidate word that calculates to the history data store module, and the candidate word of the frequency anomaly of output predetermined number.
Preferably, this device further comprises:
The denoising module is used for mating with candidate's word set that word-dividing mode is obtained according to the inactive vocabulary that sets in advance, and concentrates the word that is complementary with inactive vocabulary to carry out denoising candidate word.
This device also comprises:
Candidate word polymerization module is used for the candidate word of frequency anomaly of the predetermined number of receive frequency abnormality value processing module output, the candidate word of obtaining is aggregated into to describe the word bunch of an event or theme.
This device also comprises:
Search module is used for triggering from the data source that sets in advance and carrying out search take the word that obtains bunch or candidate word as searching key word, shows word bunch and Search Results to the user, perhaps, and candidate word and Search Results.
Wherein, frequency anomaly value processing module comprises: current frequency statistics unit, historical frequency average calculation unit, variance computing unit, abnormality value computing unit and candidate word output judging unit (not shown), wherein,
Current frequency statistics unit is used for the statistics candidate word and concentrates each candidate word in the current frequency that input text stream occurs, and exports respectively current frequency to history data store module and abnormality value computing unit;
The historical frequency average calculation unit for the historical frequency of each candidate word that reads the history data store module stores, is calculated the mean value of the historical frequency of each candidate word, exports abnormality value computing unit to;
The variance computing unit, be used for the mean value according to the historical frequency of the historical frequency of each candidate word of history data store module stores and this candidate word that the historical frequency average calculation unit calculates, calculate the variance of each candidate word, export abnormality value computing unit to;
Abnormality value computing unit is used for according to the current frequency of each candidate word, mean value and the variance of historical frequency, calculates respectively the abnormality value of each candidate word;
Candidate word output judging unit is used for the abnormality value is exported displaying greater than the candidate word of the abnormality value threshold value that sets in advance or with the candidate word of the predetermined number of abnormality value maximum.
The above is preferred embodiment of the present invention only, is not for limiting protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of doing, be equal to and replace and improvement etc., all should be included within protection scope of the present invention.

Claims (13)

1. method of excavating the focus word is characterized in that the method comprises:
Obtain the text flow of input, text flow is carried out participle, obtain candidate's word set;
The current frequency that the statistics candidate word concentrates each candidate word to occur in text flow is obtained each candidate word each historical frequency in pre-stored historical data;
The frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word, the current frequency information of this candidate word of storage and export the candidate word of the frequency anomaly of predetermined number in historical data.
2. the method for claim 1 is characterized in that, after obtaining candidate's word set, further comprises:
The inactive vocabulary that sets in advance and candidate's word set of obtaining are mated, concentrate the word that is complementary with inactive vocabulary to filter candidate word.
3. method as claimed in claim 2 is characterized in that, described inactive vocabulary comprises: nonsense words and/or, high document rate word.
4. the method for claim 1 is characterized in that, described each candidate word each historical frequency in pre-stored historical data of obtaining comprises:
If store each historical frequency of this candidate word in the historical data, read each historical frequency of this candidate word;
If do not store the historical frequency of this candidate word in the historical data, calculate the mean value of each historical frequency of all candidate word of storing in the historical data, as each historical frequency of this candidate word.
5. the method for claim 1 is characterized in that, the described frequency anomaly value of calculating this candidate word according to current frequency and each historical frequency of candidate word comprises:
Obtain the mean value of the historical frequency of this candidate word according to each historical frequency of candidate word;
According to the mean value of the historical frequency of each historical frequency of this candidate word and this candidate word of obtaining, calculate the variance of this candidate word;
Obtain the absolute value of difference of the mean value of the current frequency of this candidate word and historical frequency, calculate the merchant of this absolute value and described variance, obtain the frequency anomaly value of this candidate word.
6. such as each described method of claim 1 to 5, it is characterized in that the candidate word of the frequency anomaly of described output predetermined number is:
The word that the candidate word of the frequency anomaly of predetermined number is aggregated into to describe an event or theme bunch is exported.
7. method as claimed in claim 6 is characterized in that, the word that the candidate word of described frequency anomaly with predetermined number aggregates into to describe an event or theme bunch comprises:
Based on the candidate word of the frequency anomaly of predetermined number, add up the number of times that phrase that per two candidate word form occurs in one text stream;
Add up the number of times that these two candidate word occur respectively in one text stream, and obtain the product of the number of times that these two candidate word occur respectively in one text stream;
Obtain the number of times of described phrase appearance in one text stream and the merchant of described product, as mutual information distance between the point between described two candidate word;
If the mutual information distance value is greater than mutual information distance value threshold value between the point that sets in advance between the point that obtains, two candidate word corresponding to mutual information distance value synthesize a word bunch between then will putting.
8. such as each described method of claim 1 to 5, it is characterized in that, further comprise:
The word that forms based on the candidate word of the frequency anomaly of the predetermined number of selecting or by the candidate word polymerization bunch, triggering is carried out search from the external data source that sets in advance, and candidate word or the word bunch while of the frequency anomaly of Search Results and predetermined number are showed to the user.
9. a device that excavates the focus word is characterized in that, this device comprises: word-dividing mode, history data store module and frequency anomaly value processing module, wherein,
Word-dividing mode is used for obtaining the text flow of input, and text flow is carried out participle, obtains candidate's word set;
The history data store module is for each historical frequency of storage candidate word;
Frequency anomaly value processing module, be used for the current frequency that the statistics candidate word concentrates each candidate word to occur at text flow, calculate the frequency anomaly value of this candidate word according to each historical frequency of this candidate word of the current frequency of candidate word and history data store module stores, export the current frequency information of the candidate word that calculates to the history data store module, and the candidate word of the frequency anomaly of output predetermined number.
10. device as claimed in claim 9 is characterized in that, further comprises:
The denoising module is used for mating with candidate's word set that word-dividing mode is obtained according to the inactive vocabulary that sets in advance, and concentrates the word that is complementary with inactive vocabulary to carry out denoising candidate word.
11. such as claim 9 or 10 described devices, it is characterized in that, further comprise:
Candidate word polymerization module is used for the candidate word of frequency anomaly of the predetermined number of receive frequency abnormality value processing module output, the candidate word of obtaining is aggregated into to describe the word bunch of an event or theme.
12. device as claimed in claim 11 is characterized in that, further comprises:
Search module is used for triggering from the data source that sets in advance and carrying out search take the word that obtains bunch or candidate word as searching key word, shows word bunch and Search Results to the user, perhaps, and candidate word and Search Results.
13. device as claimed in claim 12, it is characterized in that, described frequency anomaly value processing module comprises: current frequency statistics unit, historical frequency average calculation unit, variance computing unit, abnormality value computing unit and candidate word output judging unit, wherein
Current frequency statistics unit is used for the statistics candidate word and concentrates each candidate word in the current frequency that input text stream occurs, and exports respectively current frequency to history data store module and abnormality value computing unit;
The historical frequency average calculation unit for the historical frequency of each candidate word that reads the history data store module stores, is calculated the mean value of the historical frequency of each candidate word, exports abnormality value computing unit to;
The variance computing unit, be used for the mean value according to the historical frequency of the historical frequency of each candidate word of history data store module stores and this candidate word that the historical frequency average calculation unit calculates, calculate the variance of each candidate word, export abnormality value computing unit to;
Abnormality value computing unit is used for according to the current frequency of each candidate word, mean value and the variance of historical frequency, calculates respectively the abnormality value of each candidate word;
Candidate word output judging unit is used for the abnormality value is exported displaying greater than the candidate word of the abnormality value threshold value that sets in advance or with the candidate word of the predetermined number of abnormality value maximum.
CN2011103078466A 2011-10-12 2011-10-12 Method and device for mining hot-spot words Pending CN103049443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103078466A CN103049443A (en) 2011-10-12 2011-10-12 Method and device for mining hot-spot words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103078466A CN103049443A (en) 2011-10-12 2011-10-12 Method and device for mining hot-spot words

Publications (1)

Publication Number Publication Date
CN103049443A true CN103049443A (en) 2013-04-17

Family

ID=48062087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103078466A Pending CN103049443A (en) 2011-10-12 2011-10-12 Method and device for mining hot-spot words

Country Status (1)

Country Link
CN (1) CN103049443A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455758A (en) * 2013-08-22 2013-12-18 北京奇虎科技有限公司 Method and device for identifying malicious website
CN105205048A (en) * 2015-10-21 2015-12-30 上海迪爱斯通信设备有限公司 Hot word analysis and statistic system and method
CN105740232A (en) * 2016-01-28 2016-07-06 百度在线网络技术(北京)有限公司 Method and device for automatically extracting feedback hotspots
CN106484671A (en) * 2015-08-25 2017-03-08 北京中搜网络技术股份有限公司 A kind of recognition methodss of ageing inquiry content
CN106528524A (en) * 2016-09-22 2017-03-22 中山大学 Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
CN106708796A (en) * 2015-07-15 2017-05-24 中国科学院计算技术研究所 Text-based key personal name extraction method and system
CN107330022A (en) * 2017-06-21 2017-11-07 腾讯科技(深圳)有限公司 A kind of method and device for obtaining much-talked-about topic
CN107515889A (en) * 2017-07-03 2017-12-26 国家计算机网络与信息安全管理中心 A kind of microblog topic method of real-time and device
CN107562843A (en) * 2017-08-25 2018-01-09 贵州耕云科技有限公司 A kind of hot news Phrase extraction method based on title high frequency cutting
CN107748802A (en) * 2017-11-17 2018-03-02 北京百度网讯科技有限公司 Polymerizable clc method and device
CN107885875A (en) * 2017-11-28 2018-04-06 北京百度网讯科技有限公司 Synonymous transform method, device and the server of term
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN110909130A (en) * 2019-11-19 2020-03-24 招商局金融科技有限公司 Text theme extraction and analysis method and device and computer readable storage medium
CN111125484A (en) * 2019-12-17 2020-05-08 网易(杭州)网络有限公司 Topic discovery method and system and electronic device
CN111737555A (en) * 2020-06-18 2020-10-02 苏州朗动网络科技有限公司 Method and device for selecting hot keywords and storage medium
CN113010641A (en) * 2021-03-10 2021-06-22 北京三快在线科技有限公司 Data processing method and device
CN113537691A (en) * 2021-05-09 2021-10-22 武汉兴得科技有限公司 Big data public health event emergency command method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6397211B1 (en) * 2000-01-03 2002-05-28 International Business Machines Corporation System and method for identifying useless documents
CN101296128A (en) * 2007-04-24 2008-10-29 北京大学 Method for monitoring abnormal state of internet information
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
CN102043845A (en) * 2010-12-08 2011-05-04 百度在线网络技术(北京)有限公司 Method and equipment for extracting core keywords based on query sequence cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6397211B1 (en) * 2000-01-03 2002-05-28 International Business Machines Corporation System and method for identifying useless documents
CN101296128A (en) * 2007-04-24 2008-10-29 北京大学 Method for monitoring abnormal state of internet information
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
CN102043845A (en) * 2010-12-08 2011-05-04 百度在线网络技术(北京)有限公司 Method and equipment for extracting core keywords based on query sequence cluster

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455758A (en) * 2013-08-22 2013-12-18 北京奇虎科技有限公司 Method and device for identifying malicious website
CN106708796A (en) * 2015-07-15 2017-05-24 中国科学院计算技术研究所 Text-based key personal name extraction method and system
CN106484671A (en) * 2015-08-25 2017-03-08 北京中搜网络技术股份有限公司 A kind of recognition methodss of ageing inquiry content
CN106484671B (en) * 2015-08-25 2019-05-28 北京中搜云商网络技术有限公司 A kind of recognition methods of timeliness inquiry content
CN105205048A (en) * 2015-10-21 2015-12-30 上海迪爱斯通信设备有限公司 Hot word analysis and statistic system and method
CN105205048B (en) * 2015-10-21 2018-05-04 迪爱斯信息技术股份有限公司 A kind of hot word analytic statistics system and method
CN105740232A (en) * 2016-01-28 2016-07-06 百度在线网络技术(北京)有限公司 Method and device for automatically extracting feedback hotspots
CN106528524A (en) * 2016-09-22 2017-03-22 中山大学 Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
CN107330022A (en) * 2017-06-21 2017-11-07 腾讯科技(深圳)有限公司 A kind of method and device for obtaining much-talked-about topic
CN107330022B (en) * 2017-06-21 2023-03-24 腾讯科技(深圳)有限公司 Method and device for acquiring hot topics
CN107515889A (en) * 2017-07-03 2017-12-26 国家计算机网络与信息安全管理中心 A kind of microblog topic method of real-time and device
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN108304371B (en) * 2017-07-14 2021-07-13 腾讯科技(深圳)有限公司 Method and device for mining hot content, computer equipment and storage medium
CN107562843A (en) * 2017-08-25 2018-01-09 贵州耕云科技有限公司 A kind of hot news Phrase extraction method based on title high frequency cutting
CN107748802A (en) * 2017-11-17 2018-03-02 北京百度网讯科技有限公司 Polymerizable clc method and device
CN107885875A (en) * 2017-11-28 2018-04-06 北京百度网讯科技有限公司 Synonymous transform method, device and the server of term
CN110909130A (en) * 2019-11-19 2020-03-24 招商局金融科技有限公司 Text theme extraction and analysis method and device and computer readable storage medium
CN111125484A (en) * 2019-12-17 2020-05-08 网易(杭州)网络有限公司 Topic discovery method and system and electronic device
CN111125484B (en) * 2019-12-17 2023-06-30 网易(杭州)网络有限公司 Topic discovery method, topic discovery system and electronic equipment
CN111737555A (en) * 2020-06-18 2020-10-02 苏州朗动网络科技有限公司 Method and device for selecting hot keywords and storage medium
CN113010641A (en) * 2021-03-10 2021-06-22 北京三快在线科技有限公司 Data processing method and device
CN113537691A (en) * 2021-05-09 2021-10-22 武汉兴得科技有限公司 Big data public health event emergency command method and system

Similar Documents

Publication Publication Date Title
CN103049443A (en) Method and device for mining hot-spot words
CN104063383B (en) Information recommendation method and device
CN103150374B (en) Method and system for identifying abnormal microblog users
CN103617169B (en) A kind of hot microblog topic extracting method based on Hadoop
CN106327227A (en) Information recommendation system and information recommendation method
CN102855309B (en) A kind of information recommendation method based on user behavior association analysis and device
US20170024389A1 (en) Method and system for multimodal clue based personalized app function recommendation
CN105677780A (en) Scalable user intent mining method and system thereof
CN104063521A (en) Method and device for achieving searching service
CN105095433A (en) Recommendation method and device for entities
CN108491720B (en) Application identification method, system and related equipment
CN103198072B (en) Method and device is recommended in a kind of excavation of popular search word
CN107330022A (en) A kind of method and device for obtaining much-talked-about topic
CA3116778A1 (en) Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities
CN104008203A (en) User interest discovering method with ontology situation blended in
CN111160867A (en) Large-scale regional parking lot big data analysis system
CN105022801A (en) Hot video mining method and hot video mining device
US11693905B2 (en) Chart-based time series regression model user interface
CN107481058A (en) A kind of Products Show method and Products Show device
CN104077723A (en) Social network recommending system and social network recommending method
CN108804541B (en) Electric trademark optimization system and optimization method
CN103729388A (en) Real-time hot spot detection method used for published status of network users
CN104951478A (en) Information processing method and information processing device
CN105718951A (en) User similarity estimation method and system
Wlodarczyk et al. Current trends in predictive analytics of big data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130417