CN105138577A

CN105138577A - Big data based event evolution analysis method

Info

Publication number: CN105138577A
Application number: CN201510460661.7A
Authority: CN
Inventors: 张鹏
Original assignee: BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date: 2015-07-30
Filing date: 2015-07-30
Publication date: 2015-12-09
Anticipated expiration: 2035-07-30
Also published as: CN105138577B

Abstract

The present invention provides a bid data based event evolution analysis method. The method comprises: step S100, data acquisition: performing data acquisition on network data based on a distributed cloud computing mode; step S200, data pre-processing; step S300, event extraction; step S400, hotspot event extraction; and step S500, hotspot event evolution analysis. With adoption of the distributed cloud computing mode, the big data based event evolution analysis method provided by the present invention is capable of performing mining and analysis on various massively acquired network data.

Description

A kind of event evolution analysis method based on large data

Technical field

The present invention relates to data processing field, be specifically related to a kind of event evolution analysis method based on large data.

Background technology

Along with the development of Web2.0 technology, there is earth-shaking change in internet.Internet, by static Web page and information, is transformed into the display platform of " group intelligence " that everybody participates in.By blog, microblogging, BBS, SNS, news analysis etc., netizen freely can issue the viewpoint idea of oneself and comment on any event.Network provides unprecedented opening, easily information sharing and distribution platform to people, increasing people expresses suggestion, idea, mood and the attitude of oneself by network, wherein both comprise the information to having front, positive role to the development of event, also comprise the information that some are negative, passive.Meanwhile, the opening of the network platform, substantivity and disguise make network public opinion more and more importantly affect the ideology of people.Therefore, to the timely and effective monitoring analysis of a large amount of public feelings information, to maintaining social stability, promoting, national development has important practical significance.

In daily life, accident frequently occurs, and user more and more gets used to the viewpoint and the emotion that utilize social networks (such as blog, forum, twitter, Facebook etc.) to deliver oneself.But user not keeps unalterable to the emotion of event, but constantly develops along with the change of time or the development of event, grow or die down gradually, is even transformed into another kind of emotion from a kind of emotion.How real-time online detects the emotion evolutionary process tool of user to accident is of great significance.For enterprise, can by buying the lasting follow-up of emotion after product to consumer, the shortcoming of Timeliness coverage product and deficiency.For society and government work person, by analyzing user to the emotion situation of change of event, can give a response in time accident, the even development trend of predicted events, thus find the bad symptom of a trend fast, and carry out correct guidance, the influence degree of flame is reduced to minimum.

In addition, along with the develop rapidly of the application such as mobile Internet, Internet of Things, there is explosive growth in global metadata amount.The growth at full speed of data volume imply that and entered large data age now.In prior art, the platform based on Hadoop is adopted to the process of large data.Hadoop is a Distributed Computing Platform of increasing income, and its core comprises HDFS (HadoopDistributedFilesSystem, Hadoop distributed file system).The many merits (mainly comprising high fault tolerance, high scalability etc.) of HDFS allows user to be deployed on cheap hardware by Hadoop, builds distributed type assemblies, forms distributed system.HBase (HadoopDataBase, Hadoop database) be the distributed data base system that the storage of high reliability, high-performance, row, scalable, real-time read-write are provided be based upon on distributed file system HDFS, be mainly used to store destructuring and semi-structured unstructured data.

Summary of the invention

For solving problems of the prior art, the present invention proposes a kind of event evolution analysis method based on large data.

A kind of event evolution analysis method based on large data that the present invention proposes, comprising:

Step S100, data acquisition, carries out data acquisition based on distributed cloud computing mode to network data;

Step S200, data prediction;

Step S300, event extraction, based on pretreated network data, therefrom extracts event.

Wherein, step S200 comprises:

Pre-service is carried out to the network data that step S100 gathers, first participle and part-of-speech tagging process is carried out to the network data gathered; Then, according to stop words list, stop words filtration is carried out to the result after participle; Finally obtain the characteristic item for representing document

Wherein, step S200 comprises further:

High-quality word retrieval, each characteristic item that step S200 obtains implies a mass value, its response feature item contribution degree in a document, and the quality Q (t) of characteristic item t is expressed as:

Q (t) = l_{t}^{2} (Σ_{i = 1}^{N} f_{i}^{2} - \frac{1}{N} {(Σ_{i = 1}^{N} f_{i})}^{2}),

Wherein, N represents the quantity of all documents, f _irepresent the number of times that document feature item t occurs in document i, l _tthe length of representation feature item t,

Setting threshold value Q, the characteristic item for Q (t) >Q is retained, otherwise deletes.

Wherein, step S300 comprises:

Carry out clustering documents to the document that step S200 pre-service obtains, the report of being newly arrived every day makes a Local Clustering, thus draws the local event of every day, is referred to as candidate events set;

Merger cluster, carries out merger by the candidate events set produced after Local Clustering and old event sets in the past, produces up-to-date event sets.

Wherein, described Local Clustering comprises:

(1) first to carrying out text representation through the expression model of the figure of pretreated all report use standards;

(2) in chronological order report is sorted;

(3) first section of report is got, as first event;

(4) to remaining report, do Similarity Measure successively with existing event, described Similarity Measure adopts the similar function based on maximum public subgraph to carry out, and obtains the event the most similar to it and corresponding functional value;

(5) if functional value is greater than threshold value, be then inserted in the event corresponding to this functional value; And upgrade the center of this event;

(6) if functional value is less than threshold value, then this report is as a new event, and the center of inherently this event;

(7) repeat the direct all reports in (4) ~ (6) to be all disposed;

(8) result is remained, to carry out cluster again below.

Wherein, described merger cluster comprises:

Input: the set OldTopicSet of old affair part, the set NewReportSet of new report,

Export: the event sets TopicSet after cluster

(1) first, carry out Local Clustering to the report in NewReportSet, the result after cluster is put in NewTopicSet;

(2) by event initial time, event set NewTopicSet is sorted;

(3) to all events in event set NewTopicSet, Similarity Measure is carried out successively with all events in OldTopicSet, described Similarity Measure adopts the similar function based on maximum public subgraph to carry out, and obtains the event the most similar to it and corresponding functional value;

(4) if functional value is less than threshold value, then using the event in NewTopicSet as a new events;

(5) if functional value is greater than threshold value, then this event is removed from NewTopicSet, join in OldTopicSet and go;

(6) (3) ~ (5) are repeated until all events are all disposed in NewTopicSet;

(7) remained by cluster result, the cluster for next cycle is called.

The described event evolution analysis method based on large data, also comprises:

Step S400, focus incident extracts, and again extracts focus incident in the event extracted from step S300;

Step S500, focus incident EVOLUTION ANALYSIS, carries out EVOLUTION ANALYSIS to the focus incident extracted in step S400.

Wherein, step S400 comprises:

Determine focus incident, the temperature of the event obtained by following formulae discovery step S300,

R _i＝α ₁·RF _i+α ₂·RT _i+α ₃·CN _i+α ₄·DN _i,

Wherein, R _ithe temperature of presentation of events i, RF _i: the report frequency of presentation of events i, RT _i: represent within the predetermined N days time, to the report number of days of event i and the ratio of all number of days, CN _i: the netizen of presentation of events i within predetermined number of days reads quantity to its click, DN _i: the netizen of presentation of events i within predetermined number of days is to its comment number; α ₁, α ₂, α ₃, α ₄for weight coefficient; Work as R _iwhen being greater than given threshold value R, event i is defined as focus incident.

The present invention adopts distributed cloud computing mode, can excavate, analyze the extensive various network datas gathered; And by carrying out computational analysis respectively to different pieces of information source data, obtaining the much-talked-about topic in different pieces of information source, and then determining the temperature of topic further, thus can more objectively obtain current hotspot topic.The present invention is for the unit such as Party and government offices, large enterprise and organize Timeliness coverage network sensitive information, grasp network public-opinion focus, hold network public-opinion trend, tackle network public-opinion crisis provides robotization, systematization and scientific Informational support.Effectively improve described network public-opinion monitoring system judge accuracy, for the subsequent treatment of the micro-letter public feelings information of network provide more truly, accurately basis.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the event evolution analysis method that the present invention is based on large data;

Fig. 2 is the exemplary plot of the text representation based on figure.

Embodiment

Below in conjunction with accompanying drawing of the present invention, technical scheme of the present invention is clearly and completely described.Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.

See Fig. 1, a kind of event evolution analysis method based on large data that the present invention proposes.

Step S100, data acquisition

Based on distributed cloud computing mode, data acquisition is carried out to network data, described network data comprises the data of blog, microblogging, forum, the several classification of news report webpage, and described network data is marked according to blog, microblogging, forum, the several classification of news report webpage, and stores described network data respectively by described classification; Wherein, it is reported that webpage refers to the webpage of the news that the news media website such as the portal websites such as Tengxun's news, Sina News and such as People's Daily provides.

Described data acquisition is realized by web crawlers.By the network data of distributed storage device storage of collected, described distributed storage device realizes based on HDFS.

Step S200, data prediction, carries out pre-service to the network data that step S100 gathers, and first carries out participle and part-of-speech tagging process to the network data gathered; Then, according to stop words list, stop words filtration is carried out to the result after participle; Finally obtain the characteristic item for representing document;

Vocabulary after pre-service is still huge, so still need to carry out second step, high-quality word retrieval.Each characteristic item in document implies a mass value, the words-frequency feature of so-called mass value mainly feature based item, response feature item contribution degree in the text.Quality is larger, illustrates that contribution is larger, can stay for text cluster; Otherwise, then reject.

The quality Q (t) of characteristic item t is expressed as:

Q (t) = l_{t}^{2} (Σ_{i = 1}^{N} f_{i}^{2} - \frac{1}{N} {(Σ_{i = 1}^{N} f_{i})}^{2}),

In order to carry out the text-processings such as various comparison, cluster, need to use certain model to represent text.The most frequently used text representation model is vector space model, and it is shown as the hyperspace of characteristic item composition text table, and each characteristic item is as the one dimension in space, and such text model can be expressed as follows:

D={t ₁, t ₂..., t _n, the wherein number of n representative feature item.

Although vector space model contains more information, but do not comprise the structured message of document, based on the text representation model of figure, compared with vector space model, contain some structured messages, it is conducive to the cluster of text.In the expression model of the figure of standard, each sentence of one section of document is expressed as a subgraph, and these subgraphs represent this section of document together.The concrete method for expressing of the expression model of the figure of standard is as follows:

A summit of the corresponding corresponding subgraph of word (not comprising stop words) occurred in certain sentence in document, this summit is remembered with this word mark one simultaneously; The corresponding limit of two words tightly adjacent in this, this limit all appears at title division according to two words of its two adjacent vertex correspondence or all appears at body part simultaneously, is labeled as respectively " TI " or " TX0 ".The word repeated in document is a corresponding summit only.

Such as, see Fig. 2: have one section of document D, by title " abcd " and text " aefg ", alphabetical a, b, c, d, e, f and g represent the word that 7 in document D are different.Therefore, there are 7 to mark a respectively in corresponding subgraph, b, c, d, e, f, g and six directed edges.

The present invention adopts the expression model of the figure of standard to represent the network data through data prediction.This document representation method based on figure not only have recorded the number of times (word frequency) of word and this word appearance occurred in document, also have recorded the precedence that these words occur.

Had the method for expressing of document, for the similarity measurement be more just converted into figure of two sections of documents, this is also the basis of carrying out clustering documents.The basic thought of the similarity measurement of figure: be [0 by a span, l] the value of function represent the similarity of two figure, the size of functional value reflects the similarity degree of two figure, and it is more similar to be worth larger expression two figure, when two figure are identical, function value is 1; Otherwise value is 0.The Similarity measures function of figure mainly contains: based on the similar function of maximum public subgraph, based on the similar function that figure merges, based on the not normalized similar function that figure merges, based on the similar function of maximum public subgraph and minimum public hypergraph, based on the not normalized similarity function etc. of maximum public subgraph and minimum public hypergraph.

Typically based on the similar function (TheGraphSimilarityMeasureBasedontheMaximumCommonSubgraph, MCS) of maximum public subgraph:

{Sim}_{M C S} (G_{1}, G_{2}) = 1 - \frac{| m c s (G_{1}, G_{2}) |}{m a x (| G_{1} |, | G_{2} |)},

Wherein, G ₁and G ₂represent two figure to be compared, mcs (G ₁, G ₂) represent G ₁with G ₂maximum public subgraph, namely G ₁with G ₂in identical summit and limit composition figure; | ... | represent the size of figure, the namely number on all summits of figure and the number sum on all limits, max (...) is the operation of a conventional maximizing.

Similarity function (TheGraphSimilarityMeasureBasedontheIdeaofGraphUnion, WGU) based on figure merges:

{Sim}_{W G U} (G_{1}, G_{2}) = 1 - \frac{| w g u (G_{1}, G_{2}) |}{| G_{1} | + | G_{2} | - | w g u (G_{1}, G_{2}) |},

The meaning merged based on figure refers to that the denominator of formula midsplit type represents the size also of two figure in sets theory meaning, | G ₁|+| G ₂| obtain the size sum of two figure, then deduct their union namely | wgu (G ₁, G ₂) |, just obtain their size also.

They, for the report of newly arriving every day, are just done similarity-rough set with all events found before by traditional event extraction algorithm respectively, if new report is greater than threshold value with the similarity of certain event, are then classified to this event, otherwise, then this report becomes a new events, this is a basic model of event detection, but, this pattern does not but utilize temporal information, see that news report can be known from daily, news report has so rule: the report of same event, issue to be out concentrated in (particularly in some day) in certain a period of time, this is a kind of common phenomenon in news stream, be called/edge effect, in other words in news stream, relative to issuing time from obtaining distant report, those issuing time from close to report, more likely in the same event of discussion, how to utilize this rule to improve the accuracy of event detection, it is a problem needing to consider, on this basis, the present invention proposes a kind of incident Detection Algorithm that time response carries out cluster of considering.

The basic thought of algorithm is: if to every day (or definition other times unit, such as per minute, per hour, every month) report come first does the words of a Local Clustering, can more likely relevant report be divided into together, on the basis of this Local Clustering, carry out a cluster again: by Local Clustering new events out and former cluster old affair part out, carrying out a cluster operation again, the object of current cluster operation, is that close event is carried out merger.The result finally drawn, namely net result.

The first step of algorithm, that every day (or the other times unit of definition, such as per minute, per hour, the every month) report of newly arriving is made a Local Clustering, thus drawn the local event of every day, be referred to as candidate events collection, being described below of algorithm:

Input: the set NewReportSet of new report

Export: the event sets NewTopicSet after cluster

(1) first to using the expression model of the figure of standard to carry out text representation through pretreated all reports (each report is one section of document);

(2) in chronological order report is sorted;

(3) first section of report is got, as first event;

(7) repeat the direct all reports in (4) ~ (6) to be all disposed;

(8) result is remained, to carry out cluster again below.

After finishing Local Clustering, cluster is carried out again with regard to needs, current cluster is referred to as merger cluster, object is that merger is carried out in the candidate events set produced after Local Clustering and old event sets in the past, produce up-to-date event sets, therefore, whole algorithm is called the incident Detection Algorithm based on cluster again.

Being described below of incident Detection Algorithm based on cluster again:

Input: the set OldTopicSet of old affair part, the set NewReportSet of new report

Export: the event sets TopicSet after cluster

(2) by event initial time, event set NewTopicSet is sorted;

(6) (3) ~ (5) are repeated until all events are all disposed in NewTopicSet;

(7) remained by cluster result, the cluster for next cycle is called.

What above-mentioned Local Clustering and again clustering algorithm adopted is Single-pass algorithm, but above-mentioned algorithm is only exemplary, those skilled in the art can adopt any can clustering algorithm to realize above-mentioned cluster process.

Step S400, event sentiment analysis, the event extracted for step S300 carries out sentiment analysis.

Sentiment analysis, also known as opinion mining, to there being the text of Sentiment orientation to carry out processing the process concluded, the using value huge because of it and being studied widely, be widely used in now evaluate user to satisfaction, the prediction general election result of product, predict the fields such as financial tendency.There is a large amount of work in the research for article tendentiousness aspect, but these existing methods concentrate on from the angle of static state mostly to the research of Text Orientation, pay close attention to the emotion tendency of single text, text emotion analysis is regarded as a ternary assorting process (as, actively/neutral/passive), the evolution trend of the research emotion that dynamically article is not together in series.In addition, these methods are just analyzed from the angle of content of text, not for accident, find colony's emotion dynamic change trend in time on social networks.

The emotional attitude of user to accident is diversification, mobilism.Traditional ternary disaggregated model can not well portray this character.And developing rapidly along with microblogging, textstream produces speed quickly, finds that user is to the emotion variation tendency of accident rapidly and accurately, the affective state of the public on Real-Time Monitoring microblogging stream, guides tool to be of great significance for public sentiment.

The invention provides a kind of emotion evolution analysis method, the method mainly comprises: the emotion vector determining each document message based on the emotion model comprising multiple affective style; Whether the emotion vector based on document carrys out analytical documentation emotion evolutionary process, namely detect and change for particular event public emotion, and is in what reason in which in moment and there occurs change.The method also can comprise the emoticon extracting multiple emotion word and can express user feeling, employing calculates the similarity between emotion word based on the algorithm that Hownet Semantic Similarity combines with retrieving similarity, build emotion word similarity matrix, then adopt clustering algorithm that extracted emotion word is polymerized to multiple type, thus build the emotion model comprising multiple affective style.

The emotional attitude of user to time burst is diversification, mobilism.Traditional ternary sentiment classification model (actively/neutral/passiveness) can not portray this character well.Can express by extracting the emotional symbol that in the emotion word of user feeling and network, user commonly uses in the present invention for this reason, and cluster is carried out to these emotion word, thus obtain the emotion model comprising multiple affective style.This is because a lot of emotion word is very close semantically, such as glad and happy expression all represents happy mood, and indignation and indignation all have expressed the grief and indignation mood etc. of user.In fact these words have very near similarity, in fact can regard identical emotion word as.

Wherein, can extract by number of ways the emotion word can expressing user feeling.Such as, the word that can show emotion can be extracted from dictionary.Again such as, the word that " emotion detects table " the middle extraction also can formulated for detecting user feeling from clinical psychology can show emotion, this emotion detection table comprises 212 adjectives at present.Then, can adopt clustering algorithm, such as AGNES (AgglomerativeNesting) clustering algorithm, carries out cluster to extracted emotion word, so that these emotion word are aggregated into multiple affective style.AGNES algorithm at first using each object as one bunch, then these bunches are merged length by length according to some criterion.Such as, an object in bunch A and the distance between an object in bunch B are minimum between all objects belonging to different bunches, and AB may be merged.This is a kind of singular link method, its each bunch can by bunch in all objects represent, the similarity between two bunches is determined by the similarity of the nearest data point of these two bunches of middle distances.Specific to embodiments of the invention, initially, can each emotion word be regarded as a class bunch, then carry out cluster according to the similarity between emotion word.

Similarity between emotion word can be know net (Hownet) Semantic Similarity between emotion word.Hownet Semantic Similarity is mainly used in the replaceable degree weighing word in text.Two emotion word w ₁, w ₂between the Semantic Similarity computing method of Hownet as follows:

{Sim}_{H} (w_{1}, w_{2}) = \frac{α}{d + α},

Wherein d represents these two emotion word w in the conceptional tree provided at Hownet ₁, w ₂between the length in path, have between any two concepts in the conceptional tree that Hownet provides and only have a paths, the distance of length representative two Concept Semantics of this paths.α is positive adjustable parameter, generally gets a numerical value between 0 to 1.Again such as, also can calculate similarity between emotion word based on retrieving similarity, because word close in emotion, its probability jointly occurred is larger.Based on Large Scale Corpus, the retrieval distance between two words can be expressed as:

D i s (w_{1}, w_{2}) = \frac{m a x {\log f (w_{1}), \log f (w_{2})} - \log f (w_{1}, w_{2})}{\log N - m i n {\log f (w_{1}), 1_{0} g f (w_{2})}},

Wherein, f (w _i) represent in corpus and comprise emotion word w _inumber of files, f (w ₁, w ₂) represent comprise emotion word w simultaneously ₁, w ₂number of files.Therefore two emotion word w ₁, w ₂between retrieving similarity can be expressed as:

{Sim}_{R} (w_{1}, w_{2}) = \frac{α}{D i s (w_{1}, w_{2}) + α}

Again such as, also can based on knowing that method that net Semantic Similarity combines with retrieving similarity calculates the similarity of emotion word.Such as, two emotion word w ₁, w ₂between similarity can be expressed as:

Sim(w ₁,w ₂)＝β*Sim _H(w ₁,w ₂)+(1-β)*Sim _R(w ₁,w ₂)，0≤β≤1。

By above-mentioned clustering algorithm, cluster is carried out to these emotion word, thus obtain multiple class bunch, namely obtain the multiple affective style after polymerization.Obtain the emotion model comprising multiple affective style thus.Make E=<e ₁, e ₂..., e _i... e _m> represents emotion model, wherein e _irepresent a kind of affective style, m represents the number of the element comprised in this emotion model.For every section of document d, the emotion vector of definition d is wherein, for the element of i-th in emotion model E, if document d possesses this affective style e _i, be in fact exactly that document d comprises the emotion word belonging to this affective style, then correspondingly E _din i-th element value be 1, otherwise value is 0, that is:

For every section of document d, from its emotion vector E _dcorresponding emotion model R can be extracted _d, namely deliver the emotion model of the user of the document, the namely set of the affective style that user possesses in the document, i.e. R _d=∪ e _i, such as, suppose that the emotion vector of document d is < 1,0,0,1,0,0 ... 0 >, then corresponding emotion model is (e ₁, e ₄), namely user issue the document time with emotion e ₁with emotion e ₄.

Document emotion is developed and can be analyzed from the angle of document with from the angle of user.

In one embodiment of the present of invention, from the angle of document, whether the emotion detected for accident there occurs change.Make D={d ₁, d ₂... d _i... } and be data stream collection of document, such as, can be the set with the relevant documentation of certain accident; Each d _irepresent one section of document, can mark the document with the time of delivering of the document.For given time period T, suppose that this time period T is divided into t ₁, t ₂..., t _i... t _pthe individual sub-time period, then deliver the time according to document, D can be divided into a series of disjoint subset D (t ₁), D (t ₂) ..., D (t _i) ... D (t _p), make

D = \cup_{i = 1}^{p} D (t_{i}),

D (t _i) represent time period t _ithe set of the document inside delivered.Can divide time period T with various time granularity, such as, by 1 day, 1 week, in units of January etc.To each subset D (t) of D, the summation of the emotion vector of document of the emotion of definable moment t vector E (t) for delivering in t, namely

E (t) = \underset{d &Element; D (t)}{Σ} E_{d}

Thus the problem whether decision event emotion develops can be expressed as in data stream D, given time t ₁, t ₂, study its emotion vector E (t ₁) and E (t ₂) relation.If there is notable difference between two vectors or between certain element of vector, then illustrate that emotion there occurs evolution.

In addition, analyze from document angle, user feeling variation tendency can also be found rapidly by the emotion evolution diagram constructing accident.First, the emotion vector of each document in data stream to be analyzed is determined; Then, temporally the emotion vector of granularity t to document is polymerized, obtain emotion vector E (t), construct emotion evolution diagram from the affective style that element selection K the element emotion vector E (t) is corresponding as main flow emotion by order from big to small.Wherein, time granularity t can be hour, day, week etc.Such as, suppose to select to be polymerized in units of sky, so, the main flow emotion of some day be actually according to deliver in this day comprise the blog article quantity of this emotion number select.This emotion evolution diagram laterally represents the time, in units of time granularity t, and K main flow emotion longitudinally selected by each time period.

Step S500, focus incident extracts, and again extracts focus incident in the event extracted from step S300.

Focus incident is exactly take internet as communication media, by general population's extensive concern, and wide-scale distribution can spread and continue for some time in a short period of time, the information aggregate of internet public feelings can be reflected, wherein also comprise and the semantization of internet hot spots event described and comprises the circulation way etc. of event.Internet hot spots event is normally by information that numerous netizens pay close attention to, relevant information can the appearance of large frequency in a network, wherein focus word compares the General Matters that directly can describe out focus incident, a focus incident must have the characteristic quantity of multiple focus incident to describe, and has certain similarity between these characteristic quantities.

The characteristic quantity of focus incident has:

The report frequency of event; For an important event, media will increase than usual to its relevant report, so this report frequency also can have influence on the attention rate of focus incident, that is within a period of time about the report quantity of some events with report sum ratio, ratio is larger, then the suffered attention rate of event is higher.

The duration of event; For a focus incident, if the report duration of media to it is longer, and it is also long to cause the time that netizen discusses, and so also just illustrates that the concerned degree of this event is larger.Because each event has the concerned time attribute of oneself, the initial time that we define event is the event incipient time, be exactly when the report of event is lower than moment of certain threshold value between extinction time, therefore, the span of the time of event is defined as the difference between this event start time and extinction time.

The amount of reading of event; Because the report of certain focus incident is mostly from WEB website, so click the netizen reading event relevant report more, also just illustrate that the attention rate that this event is subject to is larger, we just can read with the click of the relevant report of focus incident the attention rate that quantity carrys out recording events.

The comment number of event; If the number of reviews of netizen to a certain focus incident is more on internet, also can illustrate that the comment number of event is also the factor affecting the event degree of correlation.

For each text in the application, the report time of capital mark text (such as it is reported the time, blog, microblogging, forum deliver the time), the number of clicks of text, the number of reviews of text, the report frequency of event obtained after cluster can being determined according to the report time of text, the duration of event; The amount of reading of the event obtained after can determining cluster according to the number of clicks of text, the comment number of the event obtained after can determining cluster according to the number of reviews of text

RF _i: the report frequency of presentation of events i;

RT _i: represent within the predetermined N days time, the effective report number of days of media to pertinent events i and the ratio of all number of days, when the report quantity about event i in one day is greater than some threshold values, namely we assert that this day is just effective report number of days of event i;

CN _i: the netizen of presentation of events i within predetermined number of days reads quantity to its click;

DN _i: the netizen of presentation of events i within predetermined number of days is to its comment number;

Event temperature computing formula:

R _i＝α ₁·RF _i+α ₂·RT _i+α ₃·CN _i+α ₄·DN _i,

R _ithe temperature of presentation of events i, α ₁, α ₂, α ₃, α ₄for weight coefficient, work as R _iwhen being greater than given threshold value R, event i is defined as focus incident.

The present invention is for blog, microblogging, forum, the data of news report webpage independently carry out event extraction, supposes for blog, microblogging, forum, the focus incident set that the data of news report webpage extract is respectively BLOG, M-BLOG, BBS, NEWS, calculates BLOG, M-BLOG, BBS, the common factor of NEWS, the result obtained is defined as the first focus incident set, calculates BLOG, M-BLOG, BBS, NEWS is every three intersection of sets collection wherein, and the institute obtained is resultful and deduct the first focus incident set and be defined as second hot area event sets, calculates BLOG, M-BLOG, BBS, NEWS is every two intersection of sets collection wherein, and the institute obtained is resultful and deduct the first focus incident set and second hot area event sets is defined as the 3rd focus incident set, set B LOG, M-BLOG, BBS, NEWS and deduct the first focus incident set, the result of second hot area event sets and the 3rd focus incident set is defined as the 4th focus incident set.

The focus reflected due to blog, microblogging, forum, news report webpage may difference to some extent, so the content paid close attention to when blog, microblogging, forum, news report webpage should be the content that temperature is the highest simultaneously, in blog, microblogging, forum, news report webpage, three content temperatures simultaneously paid close attention to are taken second place, in blog, microblogging, forum, news report webpage, two content temperatures simultaneously paid close attention to are taken second place again, and the content temperature only having to pay close attention in blog, microblogging, forum, news report webpage is relatively minimum.

Step S600, focus incident EVOLUTION ANALYSIS, carries out EVOLUTION ANALYSIS to the focus incident extracted in step S500.

For the collection of document D={d that event comprises ₁, d ₂... d _i..., the time of delivering according to document carries out cluster, so just obtain the corresponding number of documents of this event different time points, cluster result is showed user with the form of coordinate diagram, abscissa line represents the time, and the coordinate longitudinal axis represents number of documents, therefrom can find out the attention rate of different time points to this event.

The present invention adopts distributed cloud computing mode, can excavate, analyze the extensive various network datas gathered; And by carrying out computational analysis respectively to different pieces of information source data, obtaining the focus incident in different pieces of information source, and then determining the temperature of event further, thus can more objectively obtain current hotspot event.The present invention is for the unit such as Party and government offices, large enterprise and organize Timeliness coverage network sensitive information, grasp network public-opinion focus, hold network public-opinion trend, tackle network public-opinion crisis provides robotization, systematization and scientific Informational support.Effectively improve described network public-opinion monitoring system judge accuracy, for the subsequent treatment of the micro-letter public feelings information of network provide more truly, accurately basis.

Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the present invention or conventional techniques means.

Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.

Claims

1., based on an event evolution analysis method for large data, comprising:

Step S200, data prediction;

2., as claimed in claim 1 based on the event evolution analysis method of large data, wherein, step S200 comprises:

Pre-service is carried out to the network data that step S100 gathers, first participle and part-of-speech tagging process is carried out to the network data gathered; Then, according to stop words list, stop words filtration is carried out to the result after participle; Finally obtain the characteristic item for representing document.

3., as claimed in claim 2 based on the event evolution analysis method of large data, wherein, step S200 comprises further:

4., as claimed in claim 1 based on the event evolution analysis method of large data, wherein, step S300 comprises:

5., as claimed in claim 4 based on the event evolution analysis method of large data, wherein, described Local Clustering comprises:

(2) in chronological order report is sorted;

(3) first section of report is got, as first event;

(7) repeat the direct all reports in (4) ~ (6) to be all disposed;

(8) result is remained, to carry out cluster again below.

6., as claimed in claim 4 based on the event evolution analysis method of large data, wherein, described merger cluster comprises:

Export: the event sets TopicSet after cluster

(2) by event initial time, event set NewTopicSet is sorted;

(6) (3) ~ (5) are repeated until all events are all disposed in NewTopicSet;

(7) remained by cluster result, the cluster for next cycle is called.

7., as claimed in claim 1 based on the event evolution analysis method of large data, also comprise:

8., as claimed in claim 7 based on the event evolution analysis method of large data, step S400 comprises:

R _i＝α ₁·RF _i+α ₂·RT _i+α ₃·CN _i+α ₄·DN _i,