CN105138577A - Big data based event evolution analysis method - Google Patents

Big data based event evolution analysis method Download PDF

Info

Publication number
CN105138577A
CN105138577A CN201510460661.7A CN201510460661A CN105138577A CN 105138577 A CN105138577 A CN 105138577A CN 201510460661 A CN201510460661 A CN 201510460661A CN 105138577 A CN105138577 A CN 105138577A
Authority
CN
China
Prior art keywords
event
report
events
cluster
evolution analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510460661.7A
Other languages
Chinese (zh)
Other versions
CN105138577B (en
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201510460661.7A priority Critical patent/CN105138577B/en
Publication of CN105138577A publication Critical patent/CN105138577A/en
Application granted granted Critical
Publication of CN105138577B publication Critical patent/CN105138577B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The present invention provides a bid data based event evolution analysis method. The method comprises: step S100, data acquisition: performing data acquisition on network data based on a distributed cloud computing mode; step S200, data pre-processing; step S300, event extraction; step S400, hotspot event extraction; and step S500, hotspot event evolution analysis. With adoption of the distributed cloud computing mode, the big data based event evolution analysis method provided by the present invention is capable of performing mining and analysis on various massively acquired network data.

Description

A kind of event evolution analysis method based on large data
Technical field
The present invention relates to data processing field, be specifically related to a kind of event evolution analysis method based on large data.
Background technology
Along with the development of Web2.0 technology, there is earth-shaking change in internet.Internet, by static Web page and information, is transformed into the display platform of " group intelligence " that everybody participates in.By blog, microblogging, BBS, SNS, news analysis etc., netizen freely can issue the viewpoint idea of oneself and comment on any event.Network provides unprecedented opening, easily information sharing and distribution platform to people, increasing people expresses suggestion, idea, mood and the attitude of oneself by network, wherein both comprise the information to having front, positive role to the development of event, also comprise the information that some are negative, passive.Meanwhile, the opening of the network platform, substantivity and disguise make network public opinion more and more importantly affect the ideology of people.Therefore, to the timely and effective monitoring analysis of a large amount of public feelings information, to maintaining social stability, promoting, national development has important practical significance.
In daily life, accident frequently occurs, and user more and more gets used to the viewpoint and the emotion that utilize social networks (such as blog, forum, twitter, Facebook etc.) to deliver oneself.But user not keeps unalterable to the emotion of event, but constantly develops along with the change of time or the development of event, grow or die down gradually, is even transformed into another kind of emotion from a kind of emotion.How real-time online detects the emotion evolutionary process tool of user to accident is of great significance.For enterprise, can by buying the lasting follow-up of emotion after product to consumer, the shortcoming of Timeliness coverage product and deficiency.For society and government work person, by analyzing user to the emotion situation of change of event, can give a response in time accident, the even development trend of predicted events, thus find the bad symptom of a trend fast, and carry out correct guidance, the influence degree of flame is reduced to minimum.
In addition, along with the develop rapidly of the application such as mobile Internet, Internet of Things, there is explosive growth in global metadata amount.The growth at full speed of data volume imply that and entered large data age now.In prior art, the platform based on Hadoop is adopted to the process of large data.Hadoop is a Distributed Computing Platform of increasing income, and its core comprises HDFS (HadoopDistributedFilesSystem, Hadoop distributed file system).The many merits (mainly comprising high fault tolerance, high scalability etc.) of HDFS allows user to be deployed on cheap hardware by Hadoop, builds distributed type assemblies, forms distributed system.HBase (HadoopDataBase, Hadoop database) be the distributed data base system that the storage of high reliability, high-performance, row, scalable, real-time read-write are provided be based upon on distributed file system HDFS, be mainly used to store destructuring and semi-structured unstructured data.
Summary of the invention
For solving problems of the prior art, the present invention proposes a kind of event evolution analysis method based on large data.
A kind of event evolution analysis method based on large data that the present invention proposes, comprising:
Step S100, data acquisition, carries out data acquisition based on distributed cloud computing mode to network data;
Step S200, data prediction;
Step S300, event extraction, based on pretreated network data, therefrom extracts event.
Wherein, step S200 comprises:
Pre-service is carried out to the network data that step S100 gathers, first participle and part-of-speech tagging process is carried out to the network data gathered; Then, according to stop words list, stop words filtration is carried out to the result after participle; Finally obtain the characteristic item for representing document
Wherein, step S200 comprises further:
High-quality word retrieval, each characteristic item that step S200 obtains implies a mass value, its response feature item contribution degree in a document, and the quality Q (t) of characteristic item t is expressed as:
Q ( t ) = l t 2 ( Σ i = 1 N f i 2 - 1 N ( Σ i = 1 N f i ) 2 ) ,
Wherein, N represents the quantity of all documents, f irepresent the number of times that document feature item t occurs in document i, l tthe length of representation feature item t,
Setting threshold value Q, the characteristic item for Q (t) >Q is retained, otherwise deletes.
Wherein, step S300 comprises:
Carry out clustering documents to the document that step S200 pre-service obtains, the report of being newly arrived every day makes a Local Clustering, thus draws the local event of every day, is referred to as candidate events set;
Merger cluster, carries out merger by the candidate events set produced after Local Clustering and old event sets in the past, produces up-to-date event sets.
Wherein, described Local Clustering comprises:
(1) first to carrying out text representation through the expression model of the figure of pretreated all report use standards;
(2) in chronological order report is sorted;
(3) first section of report is got, as first event;
(4) to remaining report, do Similarity Measure successively with existing event, described Similarity Measure adopts the similar function based on maximum public subgraph to carry out, and obtains the event the most similar to it and corresponding functional value;
(5) if functional value is greater than threshold value, be then inserted in the event corresponding to this functional value; And upgrade the center of this event;
(6) if functional value is less than threshold value, then this report is as a new event, and the center of inherently this event;
(7) repeat the direct all reports in (4) ~ (6) to be all disposed;
(8) result is remained, to carry out cluster again below.
Wherein, described merger cluster comprises:
Input: the set OldTopicSet of old affair part, the set NewReportSet of new report,
Export: the event sets TopicSet after cluster
(1) first, carry out Local Clustering to the report in NewReportSet, the result after cluster is put in NewTopicSet;
(2) by event initial time, event set NewTopicSet is sorted;
(3) to all events in event set NewTopicSet, Similarity Measure is carried out successively with all events in OldTopicSet, described Similarity Measure adopts the similar function based on maximum public subgraph to carry out, and obtains the event the most similar to it and corresponding functional value;
(4) if functional value is less than threshold value, then using the event in NewTopicSet as a new events;
(5) if functional value is greater than threshold value, then this event is removed from NewTopicSet, join in OldTopicSet and go;
(6) (3) ~ (5) are repeated until all events are all disposed in NewTopicSet;
(7) remained by cluster result, the cluster for next cycle is called.
The described event evolution analysis method based on large data, also comprises:
Step S400, focus incident extracts, and again extracts focus incident in the event extracted from step S300;
Step S500, focus incident EVOLUTION ANALYSIS, carries out EVOLUTION ANALYSIS to the focus incident extracted in step S400.
Wherein, step S400 comprises:
Determine focus incident, the temperature of the event obtained by following formulae discovery step S300,
R i=α 1·RF i2·RT i3·CN i4·DN i,
Wherein, R ithe temperature of presentation of events i, RF i: the report frequency of presentation of events i, RT i: represent within the predetermined N days time, to the report number of days of event i and the ratio of all number of days, CN i: the netizen of presentation of events i within predetermined number of days reads quantity to its click, DN i: the netizen of presentation of events i within predetermined number of days is to its comment number; α 1, α 2, α 3, α 4for weight coefficient; Work as R iwhen being greater than given threshold value R, event i is defined as focus incident.
The present invention adopts distributed cloud computing mode, can excavate, analyze the extensive various network datas gathered; And by carrying out computational analysis respectively to different pieces of information source data, obtaining the much-talked-about topic in different pieces of information source, and then determining the temperature of topic further, thus can more objectively obtain current hotspot topic.The present invention is for the unit such as Party and government offices, large enterprise and organize Timeliness coverage network sensitive information, grasp network public-opinion focus, hold network public-opinion trend, tackle network public-opinion crisis provides robotization, systematization and scientific Informational support.Effectively improve described network public-opinion monitoring system judge accuracy, for the subsequent treatment of the micro-letter public feelings information of network provide more truly, accurately basis.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the event evolution analysis method that the present invention is based on large data;
Fig. 2 is the exemplary plot of the text representation based on figure.
Embodiment
Below in conjunction with accompanying drawing of the present invention, technical scheme of the present invention is clearly and completely described.Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.
See Fig. 1, a kind of event evolution analysis method based on large data that the present invention proposes.
Step S100, data acquisition
Based on distributed cloud computing mode, data acquisition is carried out to network data, described network data comprises the data of blog, microblogging, forum, the several classification of news report webpage, and described network data is marked according to blog, microblogging, forum, the several classification of news report webpage, and stores described network data respectively by described classification; Wherein, it is reported that webpage refers to the webpage of the news that the news media website such as the portal websites such as Tengxun's news, Sina News and such as People's Daily provides.
Described data acquisition is realized by web crawlers.By the network data of distributed storage device storage of collected, described distributed storage device realizes based on HDFS.
Step S200, data prediction, carries out pre-service to the network data that step S100 gathers, and first carries out participle and part-of-speech tagging process to the network data gathered; Then, according to stop words list, stop words filtration is carried out to the result after participle; Finally obtain the characteristic item for representing document;
Vocabulary after pre-service is still huge, so still need to carry out second step, high-quality word retrieval.Each characteristic item in document implies a mass value, the words-frequency feature of so-called mass value mainly feature based item, response feature item contribution degree in the text.Quality is larger, illustrates that contribution is larger, can stay for text cluster; Otherwise, then reject.
The quality Q (t) of characteristic item t is expressed as:
Q ( t ) = l t 2 ( Σ i = 1 N f i 2 - 1 N ( Σ i = 1 N f i ) 2 ) ,
Wherein, N represents the quantity of all documents, f irepresent the number of times that document feature item t occurs in document i, l tthe length of representation feature item t,
Setting threshold value Q, the characteristic item for Q (t) >Q is retained, otherwise deletes.
Step S300, event extraction, based on pretreated network data, therefrom extracts event.
In order to carry out the text-processings such as various comparison, cluster, need to use certain model to represent text.The most frequently used text representation model is vector space model, and it is shown as the hyperspace of characteristic item composition text table, and each characteristic item is as the one dimension in space, and such text model can be expressed as follows:
D={t 1, t 2..., t n, the wherein number of n representative feature item.
Although vector space model contains more information, but do not comprise the structured message of document, based on the text representation model of figure, compared with vector space model, contain some structured messages, it is conducive to the cluster of text.In the expression model of the figure of standard, each sentence of one section of document is expressed as a subgraph, and these subgraphs represent this section of document together.The concrete method for expressing of the expression model of the figure of standard is as follows:
A summit of the corresponding corresponding subgraph of word (not comprising stop words) occurred in certain sentence in document, this summit is remembered with this word mark one simultaneously; The corresponding limit of two words tightly adjacent in this, this limit all appears at title division according to two words of its two adjacent vertex correspondence or all appears at body part simultaneously, is labeled as respectively " TI " or " TX0 ".The word repeated in document is a corresponding summit only.
Such as, see Fig. 2: have one section of document D, by title " abcd " and text " aefg ", alphabetical a, b, c, d, e, f and g represent the word that 7 in document D are different.Therefore, there are 7 to mark a respectively in corresponding subgraph, b, c, d, e, f, g and six directed edges.
The present invention adopts the expression model of the figure of standard to represent the network data through data prediction.This document representation method based on figure not only have recorded the number of times (word frequency) of word and this word appearance occurred in document, also have recorded the precedence that these words occur.
Had the method for expressing of document, for the similarity measurement be more just converted into figure of two sections of documents, this is also the basis of carrying out clustering documents.The basic thought of the similarity measurement of figure: be [0 by a span, l] the value of function represent the similarity of two figure, the size of functional value reflects the similarity degree of two figure, and it is more similar to be worth larger expression two figure, when two figure are identical, function value is 1; Otherwise value is 0.The Similarity measures function of figure mainly contains: based on the similar function of maximum public subgraph, based on the similar function that figure merges, based on the not normalized similar function that figure merges, based on the similar function of maximum public subgraph and minimum public hypergraph, based on the not normalized similarity function etc. of maximum public subgraph and minimum public hypergraph.
Typically based on the similar function (TheGraphSimilarityMeasureBasedontheMaximumCommonSubgraph, MCS) of maximum public subgraph:
Sim M C S ( G 1 , G 2 ) = 1 - | m c s ( G 1 , G 2 ) | m a x ( | G 1 | , | G 2 | ) ,
Wherein, G 1and G 2represent two figure to be compared, mcs (G 1, G 2) represent G 1with G 2maximum public subgraph, namely G 1with G 2in identical summit and limit composition figure; | ... | represent the size of figure, the namely number on all summits of figure and the number sum on all limits, max (...) is the operation of a conventional maximizing.
Similarity function (TheGraphSimilarityMeasureBasedontheIdeaofGraphUnion, WGU) based on figure merges:
Sim W G U ( G 1 , G 2 ) = 1 - | w g u ( G 1 , G 2 ) | | G 1 | + | G 2 | - | w g u ( G 1 , G 2 ) | ,
The meaning merged based on figure refers to that the denominator of formula midsplit type represents the size also of two figure in sets theory meaning, | G 1|+| G 2| obtain the size sum of two figure, then deduct their union namely | wgu (G 1, G 2) |, just obtain their size also.
They, for the report of newly arriving every day, are just done similarity-rough set with all events found before by traditional event extraction algorithm respectively, if new report is greater than threshold value with the similarity of certain event, are then classified to this event, otherwise, then this report becomes a new events, this is a basic model of event detection, but, this pattern does not but utilize temporal information, see that news report can be known from daily, news report has so rule: the report of same event, issue to be out concentrated in (particularly in some day) in certain a period of time, this is a kind of common phenomenon in news stream, be called/edge effect, in other words in news stream, relative to issuing time from obtaining distant report, those issuing time from close to report, more likely in the same event of discussion, how to utilize this rule to improve the accuracy of event detection, it is a problem needing to consider, on this basis, the present invention proposes a kind of incident Detection Algorithm that time response carries out cluster of considering.
The basic thought of algorithm is: if to every day (or definition other times unit, such as per minute, per hour, every month) report come first does the words of a Local Clustering, can more likely relevant report be divided into together, on the basis of this Local Clustering, carry out a cluster again: by Local Clustering new events out and former cluster old affair part out, carrying out a cluster operation again, the object of current cluster operation, is that close event is carried out merger.The result finally drawn, namely net result.
The first step of algorithm, that every day (or the other times unit of definition, such as per minute, per hour, the every month) report of newly arriving is made a Local Clustering, thus drawn the local event of every day, be referred to as candidate events collection, being described below of algorithm:
Input: the set NewReportSet of new report
Export: the event sets NewTopicSet after cluster
(1) first to using the expression model of the figure of standard to carry out text representation through pretreated all reports (each report is one section of document);
(2) in chronological order report is sorted;
(3) first section of report is got, as first event;
(4) to remaining report, do Similarity Measure successively with existing event, described Similarity Measure adopts the similar function based on maximum public subgraph to carry out, and obtains the event the most similar to it and corresponding functional value;
(5) if functional value is greater than threshold value, be then inserted in the event corresponding to this functional value; And upgrade the center of this event;
(6) if functional value is less than threshold value, then this report is as a new event, and the center of inherently this event;
(7) repeat the direct all reports in (4) ~ (6) to be all disposed;
(8) result is remained, to carry out cluster again below.
After finishing Local Clustering, cluster is carried out again with regard to needs, current cluster is referred to as merger cluster, object is that merger is carried out in the candidate events set produced after Local Clustering and old event sets in the past, produce up-to-date event sets, therefore, whole algorithm is called the incident Detection Algorithm based on cluster again.
Being described below of incident Detection Algorithm based on cluster again:
Input: the set OldTopicSet of old affair part, the set NewReportSet of new report
Export: the event sets TopicSet after cluster
(1) first, carry out Local Clustering to the report in NewReportSet, the result after cluster is put in NewTopicSet;
(2) by event initial time, event set NewTopicSet is sorted;
(3) to all events in event set NewTopicSet, Similarity Measure is carried out successively with all events in OldTopicSet, described Similarity Measure adopts the similar function based on maximum public subgraph to carry out, and obtains the event the most similar to it and corresponding functional value;
(4) if functional value is less than threshold value, then using the event in NewTopicSet as a new events;
(5) if functional value is greater than threshold value, then this event is removed from NewTopicSet, join in OldTopicSet and go;
(6) (3) ~ (5) are repeated until all events are all disposed in NewTopicSet;
(7) remained by cluster result, the cluster for next cycle is called.
What above-mentioned Local Clustering and again clustering algorithm adopted is Single-pass algorithm, but above-mentioned algorithm is only exemplary, those skilled in the art can adopt any can clustering algorithm to realize above-mentioned cluster process.
Step S400, event sentiment analysis, the event extracted for step S300 carries out sentiment analysis.
Sentiment analysis, also known as opinion mining, to there being the text of Sentiment orientation to carry out processing the process concluded, the using value huge because of it and being studied widely, be widely used in now evaluate user to satisfaction, the prediction general election result of product, predict the fields such as financial tendency.There is a large amount of work in the research for article tendentiousness aspect, but these existing methods concentrate on from the angle of static state mostly to the research of Text Orientation, pay close attention to the emotion tendency of single text, text emotion analysis is regarded as a ternary assorting process (as, actively/neutral/passive), the evolution trend of the research emotion that dynamically article is not together in series.In addition, these methods are just analyzed from the angle of content of text, not for accident, find colony's emotion dynamic change trend in time on social networks.
The emotional attitude of user to accident is diversification, mobilism.Traditional ternary disaggregated model can not well portray this character.And developing rapidly along with microblogging, textstream produces speed quickly, finds that user is to the emotion variation tendency of accident rapidly and accurately, the affective state of the public on Real-Time Monitoring microblogging stream, guides tool to be of great significance for public sentiment.
The invention provides a kind of emotion evolution analysis method, the method mainly comprises: the emotion vector determining each document message based on the emotion model comprising multiple affective style; Whether the emotion vector based on document carrys out analytical documentation emotion evolutionary process, namely detect and change for particular event public emotion, and is in what reason in which in moment and there occurs change.The method also can comprise the emoticon extracting multiple emotion word and can express user feeling, employing calculates the similarity between emotion word based on the algorithm that Hownet Semantic Similarity combines with retrieving similarity, build emotion word similarity matrix, then adopt clustering algorithm that extracted emotion word is polymerized to multiple type, thus build the emotion model comprising multiple affective style.
The emotional attitude of user to time burst is diversification, mobilism.Traditional ternary sentiment classification model (actively/neutral/passiveness) can not portray this character well.Can express by extracting the emotional symbol that in the emotion word of user feeling and network, user commonly uses in the present invention for this reason, and cluster is carried out to these emotion word, thus obtain the emotion model comprising multiple affective style.This is because a lot of emotion word is very close semantically, such as glad and happy expression all represents happy mood, and indignation and indignation all have expressed the grief and indignation mood etc. of user.In fact these words have very near similarity, in fact can regard identical emotion word as.
Wherein, can extract by number of ways the emotion word can expressing user feeling.Such as, the word that can show emotion can be extracted from dictionary.Again such as, the word that " emotion detects table " the middle extraction also can formulated for detecting user feeling from clinical psychology can show emotion, this emotion detection table comprises 212 adjectives at present.Then, can adopt clustering algorithm, such as AGNES (AgglomerativeNesting) clustering algorithm, carries out cluster to extracted emotion word, so that these emotion word are aggregated into multiple affective style.AGNES algorithm at first using each object as one bunch, then these bunches are merged length by length according to some criterion.Such as, an object in bunch A and the distance between an object in bunch B are minimum between all objects belonging to different bunches, and AB may be merged.This is a kind of singular link method, its each bunch can by bunch in all objects represent, the similarity between two bunches is determined by the similarity of the nearest data point of these two bunches of middle distances.Specific to embodiments of the invention, initially, can each emotion word be regarded as a class bunch, then carry out cluster according to the similarity between emotion word.
Similarity between emotion word can be know net (Hownet) Semantic Similarity between emotion word.Hownet Semantic Similarity is mainly used in the replaceable degree weighing word in text.Two emotion word w 1, w 2between the Semantic Similarity computing method of Hownet as follows:
Sim H ( w 1 , w 2 ) = α d + α ,
Wherein d represents these two emotion word w in the conceptional tree provided at Hownet 1, w 2between the length in path, have between any two concepts in the conceptional tree that Hownet provides and only have a paths, the distance of length representative two Concept Semantics of this paths.α is positive adjustable parameter, generally gets a numerical value between 0 to 1.Again such as, also can calculate similarity between emotion word based on retrieving similarity, because word close in emotion, its probability jointly occurred is larger.Based on Large Scale Corpus, the retrieval distance between two words can be expressed as:
D i s ( w 1 , w 2 ) = m a x { log f ( w 1 ) , log f ( w 2 ) } - log f ( w 1 , w 2 ) log N - m i n { log f ( w 1 ) , 1 0 g f ( w 2 ) } ,
Wherein, f (w i) represent in corpus and comprise emotion word w inumber of files, f (w 1, w 2) represent comprise emotion word w simultaneously 1, w 2number of files.Therefore two emotion word w 1, w 2between retrieving similarity can be expressed as:
Sim R ( w 1 , w 2 ) = α D i s ( w 1 , w 2 ) + α
Again such as, also can based on knowing that method that net Semantic Similarity combines with retrieving similarity calculates the similarity of emotion word.Such as, two emotion word w 1, w 2between similarity can be expressed as:
Sim(w 1,w 2)=β*Sim H(w 1,w 2)+(1-β)*Sim R(w 1,w 2),0≤β≤1。
By above-mentioned clustering algorithm, cluster is carried out to these emotion word, thus obtain multiple class bunch, namely obtain the multiple affective style after polymerization.Obtain the emotion model comprising multiple affective style thus.Make E=<e 1, e 2..., e i... e m> represents emotion model, wherein e irepresent a kind of affective style, m represents the number of the element comprised in this emotion model.For every section of document d, the emotion vector of definition d is wherein, for the element of i-th in emotion model E, if document d possesses this affective style e i, be in fact exactly that document d comprises the emotion word belonging to this affective style, then correspondingly E din i-th element value be 1, otherwise value is 0, that is:
For every section of document d, from its emotion vector E dcorresponding emotion model R can be extracted d, namely deliver the emotion model of the user of the document, the namely set of the affective style that user possesses in the document, i.e. R d=∪ e i, such as, suppose that the emotion vector of document d is < 1,0,0,1,0,0 ... 0 >, then corresponding emotion model is (e 1, e 4), namely user issue the document time with emotion e 1with emotion e 4.
Document emotion is developed and can be analyzed from the angle of document with from the angle of user.
In one embodiment of the present of invention, from the angle of document, whether the emotion detected for accident there occurs change.Make D={d 1, d 2... d i... } and be data stream collection of document, such as, can be the set with the relevant documentation of certain accident; Each d irepresent one section of document, can mark the document with the time of delivering of the document.For given time period T, suppose that this time period T is divided into t 1, t 2..., t i... t pthe individual sub-time period, then deliver the time according to document, D can be divided into a series of disjoint subset D (t 1), D (t 2) ..., D (t i) ... D (t p), make
D = &cup; i = 1 p D ( t i ) ,
D (t i) represent time period t ithe set of the document inside delivered.Can divide time period T with various time granularity, such as, by 1 day, 1 week, in units of January etc.To each subset D (t) of D, the summation of the emotion vector of document of the emotion of definable moment t vector E (t) for delivering in t, namely
E ( t ) = &Sigma; d &Element; D ( t ) E d
Thus the problem whether decision event emotion develops can be expressed as in data stream D, given time t 1, t 2, study its emotion vector E (t 1) and E (t 2) relation.If there is notable difference between two vectors or between certain element of vector, then illustrate that emotion there occurs evolution.
In addition, analyze from document angle, user feeling variation tendency can also be found rapidly by the emotion evolution diagram constructing accident.First, the emotion vector of each document in data stream to be analyzed is determined; Then, temporally the emotion vector of granularity t to document is polymerized, obtain emotion vector E (t), construct emotion evolution diagram from the affective style that element selection K the element emotion vector E (t) is corresponding as main flow emotion by order from big to small.Wherein, time granularity t can be hour, day, week etc.Such as, suppose to select to be polymerized in units of sky, so, the main flow emotion of some day be actually according to deliver in this day comprise the blog article quantity of this emotion number select.This emotion evolution diagram laterally represents the time, in units of time granularity t, and K main flow emotion longitudinally selected by each time period.
Step S500, focus incident extracts, and again extracts focus incident in the event extracted from step S300.
Focus incident is exactly take internet as communication media, by general population's extensive concern, and wide-scale distribution can spread and continue for some time in a short period of time, the information aggregate of internet public feelings can be reflected, wherein also comprise and the semantization of internet hot spots event described and comprises the circulation way etc. of event.Internet hot spots event is normally by information that numerous netizens pay close attention to, relevant information can the appearance of large frequency in a network, wherein focus word compares the General Matters that directly can describe out focus incident, a focus incident must have the characteristic quantity of multiple focus incident to describe, and has certain similarity between these characteristic quantities.
The characteristic quantity of focus incident has:
The report frequency of event; For an important event, media will increase than usual to its relevant report, so this report frequency also can have influence on the attention rate of focus incident, that is within a period of time about the report quantity of some events with report sum ratio, ratio is larger, then the suffered attention rate of event is higher.
The duration of event; For a focus incident, if the report duration of media to it is longer, and it is also long to cause the time that netizen discusses, and so also just illustrates that the concerned degree of this event is larger.Because each event has the concerned time attribute of oneself, the initial time that we define event is the event incipient time, be exactly when the report of event is lower than moment of certain threshold value between extinction time, therefore, the span of the time of event is defined as the difference between this event start time and extinction time.
The amount of reading of event; Because the report of certain focus incident is mostly from WEB website, so click the netizen reading event relevant report more, also just illustrate that the attention rate that this event is subject to is larger, we just can read with the click of the relevant report of focus incident the attention rate that quantity carrys out recording events.
The comment number of event; If the number of reviews of netizen to a certain focus incident is more on internet, also can illustrate that the comment number of event is also the factor affecting the event degree of correlation.
For each text in the application, the report time of capital mark text (such as it is reported the time, blog, microblogging, forum deliver the time), the number of clicks of text, the number of reviews of text, the report frequency of event obtained after cluster can being determined according to the report time of text, the duration of event; The amount of reading of the event obtained after can determining cluster according to the number of clicks of text, the comment number of the event obtained after can determining cluster according to the number of reviews of text
RF i: the report frequency of presentation of events i;
RT i: represent within the predetermined N days time, the effective report number of days of media to pertinent events i and the ratio of all number of days, when the report quantity about event i in one day is greater than some threshold values, namely we assert that this day is just effective report number of days of event i;
CN i: the netizen of presentation of events i within predetermined number of days reads quantity to its click;
DN i: the netizen of presentation of events i within predetermined number of days is to its comment number;
Event temperature computing formula:
R i=α 1·RF i2·RT i3·CN i4·DN i,
R ithe temperature of presentation of events i, α 1, α 2, α 3, α 4for weight coefficient, work as R iwhen being greater than given threshold value R, event i is defined as focus incident.
The present invention is for blog, microblogging, forum, the data of news report webpage independently carry out event extraction, supposes for blog, microblogging, forum, the focus incident set that the data of news report webpage extract is respectively BLOG, M-BLOG, BBS, NEWS, calculates BLOG, M-BLOG, BBS, the common factor of NEWS, the result obtained is defined as the first focus incident set, calculates BLOG, M-BLOG, BBS, NEWS is every three intersection of sets collection wherein, and the institute obtained is resultful and deduct the first focus incident set and be defined as second hot area event sets, calculates BLOG, M-BLOG, BBS, NEWS is every two intersection of sets collection wherein, and the institute obtained is resultful and deduct the first focus incident set and second hot area event sets is defined as the 3rd focus incident set, set B LOG, M-BLOG, BBS, NEWS and deduct the first focus incident set, the result of second hot area event sets and the 3rd focus incident set is defined as the 4th focus incident set.
The focus reflected due to blog, microblogging, forum, news report webpage may difference to some extent, so the content paid close attention to when blog, microblogging, forum, news report webpage should be the content that temperature is the highest simultaneously, in blog, microblogging, forum, news report webpage, three content temperatures simultaneously paid close attention to are taken second place, in blog, microblogging, forum, news report webpage, two content temperatures simultaneously paid close attention to are taken second place again, and the content temperature only having to pay close attention in blog, microblogging, forum, news report webpage is relatively minimum.
Step S600, focus incident EVOLUTION ANALYSIS, carries out EVOLUTION ANALYSIS to the focus incident extracted in step S500.
For the collection of document D={d that event comprises 1, d 2... d i..., the time of delivering according to document carries out cluster, so just obtain the corresponding number of documents of this event different time points, cluster result is showed user with the form of coordinate diagram, abscissa line represents the time, and the coordinate longitudinal axis represents number of documents, therefrom can find out the attention rate of different time points to this event.
The present invention adopts distributed cloud computing mode, can excavate, analyze the extensive various network datas gathered; And by carrying out computational analysis respectively to different pieces of information source data, obtaining the focus incident in different pieces of information source, and then determining the temperature of event further, thus can more objectively obtain current hotspot event.The present invention is for the unit such as Party and government offices, large enterprise and organize Timeliness coverage network sensitive information, grasp network public-opinion focus, hold network public-opinion trend, tackle network public-opinion crisis provides robotization, systematization and scientific Informational support.Effectively improve described network public-opinion monitoring system judge accuracy, for the subsequent treatment of the micro-letter public feelings information of network provide more truly, accurately basis.
Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the present invention or conventional techniques means.
Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.

Claims (8)

1., based on an event evolution analysis method for large data, comprising:
Step S100, data acquisition, carries out data acquisition based on distributed cloud computing mode to network data;
Step S200, data prediction;
Step S300, event extraction, based on pretreated network data, therefrom extracts event.
2., as claimed in claim 1 based on the event evolution analysis method of large data, wherein, step S200 comprises:
Pre-service is carried out to the network data that step S100 gathers, first participle and part-of-speech tagging process is carried out to the network data gathered; Then, according to stop words list, stop words filtration is carried out to the result after participle; Finally obtain the characteristic item for representing document.
3., as claimed in claim 2 based on the event evolution analysis method of large data, wherein, step S200 comprises further:
High-quality word retrieval, each characteristic item that step S200 obtains implies a mass value, its response feature item contribution degree in a document, and the quality Q (t) of characteristic item t is expressed as:
Wherein, N represents the quantity of all documents, f irepresent the number of times that document feature item t occurs in document i, l tthe length of representation feature item t,
Setting threshold value Q, the characteristic item for Q (t) >Q is retained, otherwise deletes.
4., as claimed in claim 1 based on the event evolution analysis method of large data, wherein, step S300 comprises:
Carry out clustering documents to the document that step S200 pre-service obtains, the report of being newly arrived every day makes a Local Clustering, thus draws the local event of every day, is referred to as candidate events set;
Merger cluster, carries out merger by the candidate events set produced after Local Clustering and old event sets in the past, produces up-to-date event sets.
5., as claimed in claim 4 based on the event evolution analysis method of large data, wherein, described Local Clustering comprises:
(1) first to carrying out text representation through the expression model of the figure of pretreated all report use standards;
(2) in chronological order report is sorted;
(3) first section of report is got, as first event;
(4) to remaining report, do Similarity Measure successively with existing event, described Similarity Measure adopts the similar function based on maximum public subgraph to carry out, and obtains the event the most similar to it and corresponding functional value;
(5) if functional value is greater than threshold value, be then inserted in the event corresponding to this functional value; And upgrade the center of this event;
(6) if functional value is less than threshold value, then this report is as a new event, and the center of inherently this event;
(7) repeat the direct all reports in (4) ~ (6) to be all disposed;
(8) result is remained, to carry out cluster again below.
6., as claimed in claim 4 based on the event evolution analysis method of large data, wherein, described merger cluster comprises:
Input: the set OldTopicSet of old affair part, the set NewReportSet of new report,
Export: the event sets TopicSet after cluster
(1) first, carry out Local Clustering to the report in NewReportSet, the result after cluster is put in NewTopicSet;
(2) by event initial time, event set NewTopicSet is sorted;
(3) to all events in event set NewTopicSet, Similarity Measure is carried out successively with all events in OldTopicSet, described Similarity Measure adopts the similar function based on maximum public subgraph to carry out, and obtains the event the most similar to it and corresponding functional value;
(4) if functional value is less than threshold value, then using the event in NewTopicSet as a new events;
(5) if functional value is greater than threshold value, then this event is removed from NewTopicSet, join in OldTopicSet and go;
(6) (3) ~ (5) are repeated until all events are all disposed in NewTopicSet;
(7) remained by cluster result, the cluster for next cycle is called.
7., as claimed in claim 1 based on the event evolution analysis method of large data, also comprise:
Step S400, focus incident extracts, and again extracts focus incident in the event extracted from step S300;
Step S500, focus incident EVOLUTION ANALYSIS, carries out EVOLUTION ANALYSIS to the focus incident extracted in step S400.
8., as claimed in claim 7 based on the event evolution analysis method of large data, step S400 comprises:
Determine focus incident, the temperature of the event obtained by following formulae discovery step S300,
R i=α 1·RF i2·RT i3·CN i4·DN i,
Wherein, R ithe temperature of presentation of events i, RF i: the report frequency of presentation of events i, RT i: represent within the predetermined N days time, to the report number of days of event i and the ratio of all number of days, CN i: the netizen of presentation of events i within predetermined number of days reads quantity to its click, DN i: the netizen of presentation of events i within predetermined number of days is to its comment number; α 1, α 2, α 3, α 4for weight coefficient; Work as R iwhen being greater than given threshold value R, event i is defined as focus incident.
CN201510460661.7A 2015-07-30 2015-07-30 Big data based event evolution analysis method Active CN105138577B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510460661.7A CN105138577B (en) 2015-07-30 2015-07-30 Big data based event evolution analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510460661.7A CN105138577B (en) 2015-07-30 2015-07-30 Big data based event evolution analysis method

Publications (2)

Publication Number Publication Date
CN105138577A true CN105138577A (en) 2015-12-09
CN105138577B CN105138577B (en) 2017-02-22

Family

ID=54723926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510460661.7A Active CN105138577B (en) 2015-07-30 2015-07-30 Big data based event evolution analysis method

Country Status (1)

Country Link
CN (1) CN105138577B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106331085A (en) * 2016-08-22 2017-01-11 成都天地网络科技有限公司 Operation-based big-data processing system
CN106354769A (en) * 2016-08-22 2017-01-25 成都天地网络科技有限公司 Large data cleaning processing system
CN106446224A (en) * 2016-09-30 2017-02-22 广州特道信息科技有限公司 Information discovery method based on large data
CN106874419A (en) * 2017-01-22 2017-06-20 北京航空航天大学 A kind of real-time focus polymerization of many granularities
CN106980692A (en) * 2016-05-30 2017-07-25 国家计算机网络与信息安全管理中心 A kind of influence power computational methods based on microblogging particular event
KR20180118393A (en) * 2017-04-21 2018-10-31 에스케이텔레콤 주식회사 Distributed cloud computing system, apparatus and control method thereof using the system
CN109325524A (en) * 2018-08-31 2019-02-12 中国科学院自动化研究所 Track of issues and changes phase division methods, system and relevant device
CN110704717A (en) * 2019-09-04 2020-01-17 中国科学院计算技术研究所 Network emergency detection method and system based on dynamic model
CN112612895A (en) * 2020-12-29 2021-04-06 中科院计算技术研究所大数据研究院 Method for calculating attitude index of main topic

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090019085A1 (en) * 2007-07-10 2009-01-15 Fatdoor, Inc. Hot news neighborhood banter in a geo-spatial social network
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN103020251A (en) * 2012-12-20 2013-04-03 人民搜索网络股份公司 Automatic mining system and method of news events in large-scale data
CN104035960A (en) * 2014-05-08 2014-09-10 东莞市巨细信息科技有限公司 Internet information hotspot predicting method
CN104462041A (en) * 2014-11-28 2015-03-25 上海埃帕信息科技有限公司 Method for completely detecting hot event from beginning to end

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090019085A1 (en) * 2007-07-10 2009-01-15 Fatdoor, Inc. Hot news neighborhood banter in a geo-spatial social network
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN103020251A (en) * 2012-12-20 2013-04-03 人民搜索网络股份公司 Automatic mining system and method of news events in large-scale data
CN104035960A (en) * 2014-05-08 2014-09-10 东莞市巨细信息科技有限公司 Internet information hotspot predicting method
CN104462041A (en) * 2014-11-28 2015-03-25 上海埃帕信息科技有限公司 Method for completely detecting hot event from beginning to end

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980692A (en) * 2016-05-30 2017-07-25 国家计算机网络与信息安全管理中心 A kind of influence power computational methods based on microblogging particular event
CN106980692B (en) * 2016-05-30 2020-12-08 国家计算机网络与信息安全管理中心 Influence calculation method based on microblog specific events
CN106331085A (en) * 2016-08-22 2017-01-11 成都天地网络科技有限公司 Operation-based big-data processing system
CN106354769A (en) * 2016-08-22 2017-01-25 成都天地网络科技有限公司 Large data cleaning processing system
CN106446224A (en) * 2016-09-30 2017-02-22 广州特道信息科技有限公司 Information discovery method based on large data
CN106874419A (en) * 2017-01-22 2017-06-20 北京航空航天大学 A kind of real-time focus polymerization of many granularities
CN106874419B (en) * 2017-01-22 2019-09-10 北京航空航天大学 A kind of real-time hot spot polymerization of more granularities
KR20180118393A (en) * 2017-04-21 2018-10-31 에스케이텔레콤 주식회사 Distributed cloud computing system, apparatus and control method thereof using the system
KR102198995B1 (en) * 2017-04-21 2021-01-06 에스케이텔레콤 주식회사 Distributed cloud computing system, apparatus and control method thereof using the system
CN109325524A (en) * 2018-08-31 2019-02-12 中国科学院自动化研究所 Track of issues and changes phase division methods, system and relevant device
CN110704717A (en) * 2019-09-04 2020-01-17 中国科学院计算技术研究所 Network emergency detection method and system based on dynamic model
CN112612895A (en) * 2020-12-29 2021-04-06 中科院计算技术研究所大数据研究院 Method for calculating attitude index of main topic

Also Published As

Publication number Publication date
CN105138577B (en) 2017-02-22

Similar Documents

Publication Publication Date Title
CN104965931A (en) Big data based public opinion analysis method
CN105138577B (en) Big data based event evolution analysis method
Smeureanu et al. Applying supervised opinion mining techniques on online user reviews
CN105068991A (en) Big data based public sentiment discovery method
Sun et al. A novel stock recommendation system using Guba sentiment analysis
CN104965930A (en) Big data based emergency evolution analysis method
CN104965823A (en) Big data based opinion extraction method
CN103399891A (en) Method, device and system for automatic recommendation of network content
Wang et al. Evaluating the competitiveness of enterprise’s technology based on LDA topic model
CN103559176A (en) Microblog emotional evolution analysis method and system
CN105183765A (en) Big data-based topic extraction method
CN104077417A (en) Figure tag recommendation method and system in social network
Corallo et al. Sentiment analysis for government: An optimized approach
Yan et al. An improved single-pass algorithm for chinese microblog topic detection and tracking
CN105869058B (en) A kind of method that multilayer latent variable model user portrait extracts
Kumar et al. Battling fake news: A survey on mitigation techniques and identification
CN102073646A (en) Blog group-oriented subject propensity processing method and system
CN105159879A (en) Automatic determination method for network individual or group values
Wang et al. Textual sentiment of Chinese microblog toward the stock market
Chen et al. Research on clustering analysis of Internet public opinion
CN103761246A (en) Link network based user domain identifying method and device
Atoum Detecting cyberbullying from tweets through machine learning techniques with sentiment analysis
Kohli et al. A clustering approach for optimization of search result
Nurcahyawati et al. Online Media as a Price Monitor: Text Analysis using Text Extraction Technique and Jaro-Winkler Similarity Algorithm
Sun et al. GubaLex: Guba-oriented sentiment lexicon for big texts in finance

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant