CN103020482A

CN103020482A - Relation-based spam comment detection method

Info

Publication number: CN103020482A
Application number: CN2013100025837A
Authority: CN
Inventors: 张卫丰; 王云; 周国强; 张迎周; 王子元; 周国富; 钱小燕; 许碧欢; 陆柳敏
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2013-01-05
Filing date: 2013-01-05
Publication date: 2013-04-03

Abstract

The invention relates to a relation-based spam comment detection method, which is based on the relation characteristic of critics of online shopping, comments and shop owners. The method raises concepts of the credibility of critics, the integrity of comments and the credibility of shop owners and leads to the mutual relation of the three concepts as follows: the higher the integrity of comments written by the critics is, the higher the credibility of the critics is; the more the honest comments of the shop owners from the credible critics is, the higher the integrity of the shop owners is; and the more the number of comments supported by other honest comments is, the higher the integrity of the comments is. The iteration relation is raised for the first time, and the iteration relation is applied to the actual detection work. The relation characteristic is utilized to establish a model, and the model is combined with models obtained by other characteristics of the three concepts, so that an improved model used for spam comment detection is obtained.

Description

A kind of rubbish comment detection method based on relation

Technical field

The present invention relates to a kind of method that detects based on the rubbish comment of relation, it mainly is the mutual relationship feature of analyzing between reviewer, comment and the hotel owner three, and model based on this relation proposed, combine with this model with according to three's the resulting model of other features, reach the purpose that comment detects to rubbish.Mainly solve current technology and be the problems such as unicity that the rubbish comment detects the model that proposes and limitation, belonged to machine learning and Data Mining.

Background technology

The shopping online comment is compared commercial quality for client, and hotel owner's service and many other aspects provide valuable information.But the rubbish reviewer occurs now, their purpose is by issuing false or inequitable comment, misleading normal client to commodity or hotel owner's impression.Such as occupation difference teacher of the commenting, as its name suggests, by to others the poor people who comments life, be the emerging occupation that is expedited the emergence of by Taobao exactly.

In more wide field, great majority mainly concentrate on webpage and mail field about the research of rubbish activity. and the rubbish behavior on the webpage is divided into two large classes: rubbish contents and rubbish link.The rubbish link is the rubbish behavior on hyperlink, owing to generally do not have link in the comment, the rubbish link can not appear in the rubbish comment.Content rubbish refers to add incoherent literal in webpage, cheats search engine with this.The reviewer can not add incoherent literal in their comment.Spam typically refers to and sends unapproved commercial advertisement.Although advertisement can occur in comment, quantity after all seldom.

The rubbish comment detection algorithm in early stage all is to use reviewer's behavior to distinguish the rubbish reviewer, for example, and the similarity of comment text, the similarity of scoring and deviation, the commodity amount of rubbish comment etc.According to existing research, these behaviors are effective to the rubbish comment activity of particular type.For example, the reviewer uses a large amount of Similar Texts in to the various comments of identical goods, and the reviewer gives the scoring of different commodity unusual high or low in a short time frequently, and this reviewer probably is exactly the rubbish reviewer.

Nitin and Liu have proposed the problem that comment detects about rubbish first in 2008.The rubbish comment is divided into three types: false comment, only for the comment of brand, do not comment on the comment of content.Use the method for monitoring to detect the rubbish comment: at first, to extract one about comment, the feature set of reviewer and commodity; Then, mainly use text similarity and some artificial means sign rubbish comments.Based on these features and sorter of training data structure, comment on for detection of rubbish.The method largely depends on text similarity, only such rubbish comment behavior is produced effect.

Jindal proposed the unexpected rule of a kind of usage mining in 2010 algorithm detects the rubbish comment.Regard every comment as the record relevant with certain evaluation class, this comment class comprises positive evaluation class, negative evaluation class and neutral evaluation class.Use unexpected rule digging algorithm to generate a unexpected list of rules.Yet this method can not be distinguished real rubbish reviewer, can only find some as the strange behavior of unexpected rule.

Lim proposed another kind of rubbish comment detection method based on reviewer's behavior in 2010.They have found the feature of many rubbish comment behaviors, for example, and various evaluations or comment and effort analysis on single commodity or the one group of commodity.Each reviewer obtains different marks in these features, again these marks is carried out linear combination, and last PTS is exactly this reviewer's suspicious degree.This method is non-supervisory, has saved the cost of many artificial signs.Yet, still depend in essence text similarity according to their research.The rubbish comment that therefore, also can only be used for some specific types detects.

The weak point of above the whole bag of tricks also is, all text of a research and utilization rubbish comment or scoring feature, and this has limitation.Therefore, detect the rubbish comment in the urgent need to a kind of new method.Because in net purchase, the reviewer, comment, hotel owner three is the individuality that can not isolate, and has the relation of many inherences between the three.Therefore find out the relation between this three, and apply it in the rubbish comment testing, find out the dependence of this feature and other behavioural characteristics again, this will improve the degree of accuracy of testing greatly.

Summary of the invention

Technical matters: the rubbish based on relation that the purpose of this invention is to provide a kind of novelty is commented on the method that detects.For the relationship characteristic between reviewer, comment and the hotel owner three, utilize this feature to carry out modeling, combine with this model with according to the resulting model of three's inherent feature, obtain three models that connect each other that represent respectively reviewer, comment and hotel owner.At last, utilize these models to obtain reviewer's confidence level, the honest degree of comment and hotel owner's fiduciary level, detect the purpose that rubbish is commented on according to certain standard to reach.

Technical scheme: the rubbish comment detection method based on relation that the present invention proposes is a kind of reviewer based on net purchase, the detection method of comment and hotel owner's relationship characteristic.Reviewer's confidence level has been proposed, the concept of the honest degree of comment and hotel owner's fiduciary level, and drawn three's mutual relationship: the honest degree of the comment that the reviewer writes is higher, and his confidence level is just higher; The honesty comment from believable reviewer that the hotel owner has is more, and his fiduciary level is just higher; The number that comment is supported by other honest comments is more, and his honest degree is just higher.In the method that current rubbish comment detects, propose such iterative relation for the first time, and apply it in the actual testing.Utilize this relationship characteristic to carry out modeling, this model and three's the resulting model of other features is combined, be used for the model that the rubbish comment detects after being improved.

Rubbish comment detection method based on relation mainly is divided into following steps:

Step 1) is calculated the honest degree mark of comment:

Step 1.1) input comment aggregate information:

Step 1.2) obtains score value and the comment time of all comments;

Step 1.3) calculates the mean value of scoring and commenting on the time the earliest;

Step 1.4) obtains a review information;

Step 1.5) judge that whether review information is empty, if be not empty, then turns step 1.6), otherwise, turn step 1.10);

Step 1.6) calculate the honest degree mark of comment:

Step 1.6.1) obtains the score value of this comment;

Step 1.6.2) according to step 1.3) mean value, it is poor to calculate scoring;

Step 1.6.3) obtains comment time of this comment;

Step 1.6.4) according to step 1.3) the earliest comment time, calculate the comment mistiming;

Step 1.6.5) obtains the comment text of this comment;

Step 1.6.6) according to the cosine law, calculates the text similarity of comment text;

Step 1.6.7) according to step 1.6.2) the poor IRD of scoring, step 1.6.4) mistiming IETF, step 1.6.6) similarity ICS, calculate the honest degree mark A of comment:

A＝β ₁IRD+β ₂ICS+β ₃IETF (1)

β wherein ₁, β ₂, β ₃Be constant, and satisfy β ₁+ β ₂+ β ₃=1;

Step 1.7) upgrades the honest degree attribute of commenting on;

Step 1.8) obtains next review information;

Step 1.9) judge that whether this review information is empty, if empty, turns step 1.10), otherwise, turn step 1.2);

Step 1.10) the honest degree mark of output comment;

Step 2) calculate hotel owner's fiduciary level:

Step 2.1) variable h=1 is set;

Step 2.2) obtains h hotel owner's information;

Step 2.3) judge that whether the hotel owner is empty, if be not empty, turns step 2.4), otherwise, turn step 2.8);

Step 2.4) calculating hotel owner's fiduciary level mark:

Step 2.4.1) obtains this hotel owner's commodity degree of conforming to, seller's service, commodity and service, commodity price, the quantitative information of goods delivery;

Step 2.4.2) calculate " S " type score:

S (x) = \{\begin{matrix} α \sqrt[3]{x - β} + γ, & x &GreaterEqual; 0 \\ 0, & x < 0 \end{matrix} - - - (2)

Wherein α, β, λ are constant, and x is hotel owner's quantitative information;

Step 2.4.3) generates the weight vector of marking;

Step 2.4.4) " S " type score step 2.4.2) multiply by weight vector, obtains the fiduciary level mark;

Step 2.5) renewal hotel owner's fiduciary level attribute;

Step 2.6) h=h+1 turns step 2.2);

Step 2.8) output hotel owner's fiduciary level mark;

Step 3) is calculated reviewer's confidence level:

Step 3.1) obtains all reviewer's information;

Step 3.2) obtains reviewer's information;

Step 3.3) judge that whether reviewer's information is empty, if be not empty, turns step 3.4), otherwise, turn step 3.8);

Step 3.4) calculating reviewer's confidence level mark:

Step 3.4.1) obtains this reviewer's dealing money, credit information;

Step 3.4.2) obtains corresponding score value;

Step 3.4.3) weight vectors of generation score value;

Step 3.4.4) score value step 3.4.2) multiply by weight vectors, obtains reviewer's confidence level mark;

Step 3.5) renewal reviewer's confidence level attribute;

Step 3.6) obtains next reviewer's information, turn step 3.3);

Step 3.8) output reviewer's confidence level mark;

Step 4) initialization iterations is 0;

Step 5) is upgraded the honest degree mark of comment;

Step 5.1) obtain relational model:

H (r) = R (s) (\frac{2}{1 + e^{T (r)}} - 1) - - - (3)

Wherein, R (s) is the fiduciary level mark of hotel owner s, and T (r) is the confidence level mark of reviewer r;

Step 5.2) calculate the honest degree mark of commenting on:

Step 5.2.1) obtains the reviewer's who delivers this comment confidence level mark;

Step 5.2.2) obtains the hotel owner's that comment comments on fiduciary level mark;

Step 5.3.3) according to step 5.1) model calculate honest degree mark;

Step 5.4) upgrades the honest degree attribute information of commenting on;

The honest degree mark of the comment after step 5.5) output is upgraded;

Step 6) is upgraded reviewer's confidence level mark:

Step 6.1) obtain relational model:

T (r) = \frac{2}{1 + e^{H (r)}} - 1 - - - (4)

Wherein, H (r) is the honest degree mark of comment r;

Step 6.2) calculating reviewer's confidence level mark:

Step 6.2.1) obtains the honest degree of all comments that this reviewer delivers;

Step 6.2.2) according to step 6.1) model calculate reviewer's confidence level mark;

Step 6.3) renewal reviewer's confidence level attribute information;

Step 6.4) the confidence level mark of the reviewer after output is upgraded;

Step 7) is upgraded hotel owner's fiduciary level mark:

Step 7.1) obtain relational model:

R (s) = \frac{2}{1 + e^{- θ}} - 1 - - - (5)

θ = \underset{v &Element; U_{s}, T (k_{v}) > 0}{Σ} T (k_{v}) (Ψ_{v} - μ) - - - (6)

Wherein, T (k _v) be the reviewer k of v of making comments _vConfidence level, Ψ _vBe the scoring of comment v, μ is the mean value of system's comment;

Step 7.2) calculating hotel owner's fiduciary level mark:

Step 7.2.1) obtains this hotel owner's reviewer's confidence level mark;

Step 7.2.2) obtains the scoring of all comments of reviewer;

Step 7.2.3) according to step 7.1) model calculate hotel owner's fiduciary level mark;

Step 7.3) renewal hotel owner's fiduciary level attribute;

The step 8) iterations adds 1;

Whether step 9) judges iterations less than 5, if, turn step 5), otherwise, step 10) turned;

Step 10) output hotel owner's fiduciary level mark, the honest degree mark of comment, reviewer's confidence level mark;

Step 11) output detections result: normal comment, rubbish comment; Normal reviewer, rubbish reviewer.

Beneficial effect: the present invention contrasts existing technology, has following innovative point:

Inherent dependence for reviewer, comment and hotel owner three has proposed the model based on this relation, combines with this model with according to three's the resulting model of other features.

In a word, by using this method, obtain having the result of good reference value and decision value, improved precision and recall rate that the rubbish comment detects.

Description of drawings

Fig. 1 detects rubbish comment process flow diagram;

Fig. 2 calculates hotel owner's fiduciary level process flow diagram;

Fig. 3 calculates the honest degree process flow diagram of comment;

Fig. 4 calculates reviewer's confidence level process flow diagram.

Embodiment

Based on the rubbish comment detection method of relation, producing with Eclipse is developing instrument, and MATLAB combines with yaahp analytical hierarchy process software and does data analysis.Wherein detailed step is as follows, sees Fig. 1.

1, a kind of rubbish comment detection method based on relation is characterized in that the method mainly is divided into following steps:

The honest degree model of step 1) structure comment: from commenting on given scoring, the text similarity of comment and other comments is commented on time three aspects of issuing and is made up model, as seen in Figure 3.

Step 1.1) according to the information of all comments, calculates the mean value of scoring and comment on the time the earliest;

Step 1.2) according to the scoring fractional value of comment, calculate mean value poor of score value and scoring:

D (p) = \frac{| r_{p - \overset{&OverBar;}{r_{p}}} |}{4} - - - (1)

Wherein, r _pBe this comment to the scoring of commodity P,

Be the average mark of the resulting comment of commodity p, maximum scores is poor to be that 4, D (p) calculates the scoring of comment and the degree of deviation of commodity average mark.

Step 1.3) according to the comment time of comment, calculate comment time and the mistiming of commenting on the earliest the time:

The rubbish reviewer is in order to produce larger impact, the wrong information of time issue through being everlasting early, so the issuing time of comment from comment on commodity the earliest issuing time more close to, for the possibility of rubbish comment larger.

GTF (p) = \{\begin{matrix} 0 & if T (p) - A (p) > β \\ 1 - \frac{T (p) - A (p)}{β} & otherwise \end{matrix} - - - (2)

Wherein, T (p) is the time that obtains commenting on, and A (p) is the comment time that commodity P obtains the earliest, and β is time threshold, if the mistiming surpasses this thresholding, the possibility that then is expressed as the rubbish comment is 0.It is poor that GTF (p) calculates the comment issuing time.

Step 1.4) according to the comment text of comment, calculate the text similarity of comment text:

The rubbish reviewer may repeat to comment on this commodity, because it is very tired all to write the comment of different content at every turn, so comment text also is to copy or close other comment texts that copies, so it is higher to work as the similarity of text, the possibility of commenting on for rubbish is larger.

ICS＝avg(cosine(c(p))) (3)

Wherein, c (p) is the comment text of commodity p, and cosine (c (p)) uses the text similarity based on the calculating of vector space cosine Similarity algorithm and other comments.ICS calculates the mean value of several text similarities.

Step 1.5) the honest degree mark of comment is calculated in will mark poor, mistiming, the linear combination of text similarity.

A(r)＝β ₁IRD(p)+β ₂ICS(p)+β ₃IETF(p) (4)

Wherein

β_{1} = \frac{1}{5},

β_{2} = \frac{2}{5},

β_{3} = \frac{2}{5};

Step 2) calculates hotel owner's fiduciary level: be the satisfaction aspect five of the commodity degree of conforming to, seller's service, commodity and service, commodity price, goods delivery to be given a mark after closing the transaction according to the buyer, in conjunction with weights structure model separately.As seen in Figure 2.

Step 2.1) according to hotel owner's commodity degree of conforming to, seller's service, commodity and service, commodity price, goods delivery information structuring score function:

When user satisfaction changes to when better from fine, its score changes should be slower; From better when very poor, its score changes should be greatly.This is because qualitative change has occured user satisfaction; And satisfaction is poorer, and score is lower, so score function is:

S (x) = \{\begin{matrix} α \sqrt[3]{x - β} + γ, & x &GreaterEqual; 0 \\ 0, & x < 0 \end{matrix} - - - (5)

α wherein, β, λ are constant, x hotel owner information quantization value.

Step 2.2) calculates the score value of each information by score function;

Step 2.3) with the linear combination of score value, obtains hotel owner's fiduciary level mark;

Step 3) is calculated reviewer's confidence level: from this dealing money, two aspects of buyer credit degree make up the score function model, as seen in Figure 4.

Step 3.1) obtains all reviewer's information;

Step 3.2) calculates corresponding score value and the weight vectors of score value according to reviewer's dealing money, credit information;

Step 3.3) according to comment value and weight vector computation reviewer's confidence level mark;

Step 4) is upgraded the honest degree of comment: even a comment and its other comments on every side are inconsistent, and this comment is to be delivered by believable reviewer, and other comments on every side are to be delivered by incredible reviewer, and this comment remains honest comment so:

Step 4.1) calculate honest degree mark according to the honest degree relational model of comment:

H (r) = R (s) (\frac{2}{1 + e^{T (r)}} - 1) - - - (9)

Wherein, R (s) is hotel owner's fiduciary level mark, and T (r) is reviewer's confidence level mark.

Step 4.3) upgrades the honest degree attribute information of commenting on;

Step 5) is upgraded reviewer's confidence level: the height of reviewer's confidence level depend on front comment that he delivers and negative reviews how much.The honest degree mark summation of the comment of delivering is higher, and this reviewer's confidence level is higher;

Step 5.1) calculate the confidence level mark according to reviewer's confidence level relational model:

T (r) = \frac{2}{1 + e^{H (r)}} - 1 - - - (10)

Wherein, H (r) is the honest degree mark of comment.

Step 5.3) renewal reviewer's confidence level attribute information;

Step 6) is upgraded hotel owner's fiduciary level: hotel owner's fiduciary level mainly depends on the comment that all credible reviewers do.The front of being done by the credible reviewer comment that has is more, and hotel owner's fiduciary level is higher;

Step 6.1) calculate the fiduciary level mark according to hotel owner's fiduciary level relational model:

R (s) = \frac{2}{1 + e^{- θ}} - 1 - - - (11)

θ = \underset{v &Element; U_{s}, T (k_{v}) > 0}{Σ} T (k_{v}) (Ψ_{v} - μ) - - - (12)

Wherein, T (k _v) be reviewer's confidence level, Ψ _vBe the scoring that this reviewer sends out comment, μ is the mean value of system's comment.

Step 6.3) renewal hotel owner's fiduciary level attribute;

Step 7) output hotel owner's fiduciary level mark, the honest degree mark of comment, reviewer's confidence level mark;

Step 8) output detections result: normal comment, rubbish comment; Normal reviewer, rubbish reviewer.

Claims

1. the rubbish based on relation is commented on detection method, it is characterized in that the method mainly is divided into following steps:

Step 1) is calculated the honest degree mark of comment:

Step 1.1) input comment aggregate information:

Step 1.2) obtains score value and the comment time of all comments;

Step 1.4) obtains a review information;

Step 1.6) calculate the honest degree mark of comment:

Step 1.6.1) obtains the score value of this comment;

Step 1.6.2) according to step 1.3) mean value, it is poor to calculate scoring;

Step 1.6.3) obtains comment time of this comment;

Step 1.6.5) obtains the comment text of this comment;

A＝β ₁IRD+β ₂ICS+β ₃IETF (1)

Step 1.7) upgrades the honest degree attribute of commenting on;

Step 1.8) obtains next review information;

Step 1.10) the honest degree mark of output comment;

Step 2) calculate hotel owner's fiduciary level:

Step 2.1) variable h=1 is set;

Step 2.2) obtains h hotel owner's information;

Step 2.4) calculating hotel owner's fiduciary level mark:

Step 2.4.2) calculate " S " type score:

S (x) = \{\begin{matrix} α \sqrt[3]{x - β} + γ, & x &GreaterEqual; 0 \\ 0, & x < 0 \end{matrix} - - - (2)

Step 2.4.3) generates the weight vector of marking;

Step 2.5) renewal hotel owner's fiduciary level attribute;

Step 2.6) h=h+1 turns step 2.2);

Step 2.8) output hotel owner's fiduciary level mark;

Step 3) is calculated reviewer's confidence level:

Step 3.1) obtains all reviewer's information;

Step 3.2) obtains reviewer's information;

Step 3.4) calculating reviewer's confidence level mark:

Step 3.4.1) obtains this reviewer's dealing money, credit information;

Step 3.4.2) obtains corresponding score value;

Step 3.4.3) weight vectors of generation score value;

Step 3.5) renewal reviewer's confidence level attribute;

Step 3.6) obtains next reviewer's information, turn step 3.3);

Step 3.8) output reviewer's confidence level mark;

Step 4) initialization iterations is 0;

Step 5) is upgraded the honest degree mark of comment;

Step 5.1) obtain relational model:

H (r) = R (s) (\frac{2}{1 + e^{T (r)}} - 1) - - - (3)

Step 5.2) calculate the honest degree mark of commenting on:

Step 5.3.3) according to step 5.1) model calculate honest degree mark;

Step 5.4) upgrades the honest degree attribute information of commenting on;

The honest degree mark of the comment after step 5.5) output is upgraded;

Step 6) is upgraded reviewer's confidence level mark:

Step 6.1) obtain relational model:

T (r) = \frac{2}{1 + e^{H (r)}} - 1 - - - (4)

Wherein, H (r) is the honest degree mark of comment r;

Step 6.2) calculating reviewer's confidence level mark:

Step 6.3) renewal reviewer's confidence level attribute information;

Step 6.4) the confidence level mark of the reviewer after output is upgraded;

Step 7) is upgraded hotel owner's fiduciary level mark:

Step 7.1) obtain relational model:

R (s) = \frac{2}{1 + e^{- θ}} - 1 - - - (5)

θ = \underset{v &Element; U_{s}, T (k_{v}) > 0}{Σ} T (k_{v}) (Ψ_{v} - μ) - - - (6)

Step 7.2) calculating hotel owner's fiduciary level mark:

Step 7.2.1) obtains this hotel owner's reviewer's confidence level mark;

Step 7.2.2) obtains the scoring of all comments of reviewer;

Step 7.3) renewal hotel owner's fiduciary level attribute;

The step 8) iterations adds 1;