The content of the invention
Present invention aims to the deficiency that prior art is present, there is provided a kind of to advise for finding to associate between text word
Complete weighting pattern method for digging then, enriches the Association Rule Mining achievement excavated based on project weights, solves item complete
Weight full the technical barrier in positive and negative association rule mining.The method has important in fields such as text mining, document information retrievals
Theory value and wide application prospect.
The present invention realizes that above-mentioned purpose is adopted the technical scheme that:It is a kind of for finding the complete of correlation rule between text word
Full weighting pattern method for digging, comprises the steps:
(1)Complete weighted data pretreatment stage:
In real world, there is the complete weighted data of magnanimity, such as Text Information Data etc..Weighted data pretreatment completely
Method will depending on specific data object, for example, for Chinese text data message, then to carry out participle, remove stop words,
Extract the preprocess method such as Feature Words and its weight computing;For English text data message, preprocess method be stem extract,
Exclude stop words, lexical analysis, extract Feature Words and its weight computing etc..The result of weighted data pretreatment completely is to build base
In complete weighted data storehouse and project library;
For the Feature Words weight computing formula of text data is:wij=(0.5+0.5×tfij/maxj(tfij))×idfi,
Wherein, wijFor ith feature word jth piece document weights, tfijIt is ith feature word in jth piece document
Word frequency, idfiFor the reverse document frequency of ith feature word, its value idfi=log(N/dfi), N is total number of documents in document sets,
dfiIt is the number of documents containing ith feature word.
(2)Weighted frequent items and negative dependent excavation phase, comprise the following steps 2.1 and step 2.2 completely:
2.1st, extract from project library and weight completely candidate's 1_ item collections awC1, and excavation weights frequent 1_ item collections awL completely1;
Concrete steps are carried out according to 2.1.1~2.1.3:
2.1.1, extract from project library and weight completely candidate's 1_ item collections awC1;
2.1.2, add up and weight candidate's 1_ item collections awC completely1In complete weighted data storehouse (All-Weighted
Database, abbreviation AWD) in weights summation, calculate its support;
awC1Support computing formula is as follows:
Wherein,Expression project ijIn transaction journal TiIn weights summation, n for completely plus
The transaction journal sum of power database AWD, k is item collection awC1Length(That is awC1Project number).
2.1.3, candidate's 1_ item collections C will be weighted completely1Middle support is more than or equal to minimum support threshold value minsup
Frequent 1_ item collections awL are weighted completely1It is added to frequent item set set awPIS;
2.2nd, from the beginning of candidate's 2_ item collections are weighted completely, operated according to step 2.2.1~2.2.4:
2.2.1 frequent (i-1) _ item collection awL, will be weighted completelyi-1Apriori connections are carried out, is generated
Item collection awCi;Described i >=2;
2.2.2, add up and weight candidate's i_ item collections awC completelyi-1Weights summation in complete weighted data storehouse AWD, calculates
Its support awsup (awCi-1), its computing formula is as follows:
Wherein,Expression project ijIn transaction journal TiIn weights summation, n has been
The transaction journal sum of full weighted data storehouse AWD, k is item collection awCi-1Length.
2.2.3, from weighting candidate's i_ item collections awC completelyiThe middle frequency that its support is not less than support threshold minsup
Numerous i_ item collections awLiTake out, be stored in complete weighted frequent items set awPIS, meanwhile, its support is less than into support threshold
Negative i_ item collections awN of weighting completelyiIt is stored in weighting negative dependent set awNIS completely.
2.2.4 the value of i is added 1, if frequently (i-1) _ item collection awLi-1For sky(It is that its length is 0)Proceed to(3)Step,
Otherwise, 2.2.1~2.2.3 steps are continued;
(3)The beta pruning stage:Interesting complete weighted frequent items and negative dependent are obtained by the beta pruning stage
3.1st, for each frequent i- item collection awL in frequent item set set awPISi, calculate IAWFI (awLi) value, cut
Except its IAWFI (awLi) value is false frequent item set, obtains interesting complete weighted frequent items set awPIS after beta pruning;
IAWFI(awLi) computing formula is as follows:
Wherein, awItemsetInt (I1∪I2)=awsup(I1)×awsup(I1∪I2)×(1–awsup(I2)),
AwItemsetInt (I1, I2)=awsup(I2)×(1–awsup(I1))×(1–awsup(I1)–awsup(I2)+awsup
(I1∪I2)), minInt be minimum interestingness threshold value, minsup minimum support threshold values.
3.2nd, for each negative i- item collection awN in negative dependent set awNISi, calculate IAWNI (awNi) value, wipe out which
IAWNI(awNi) value is false negative dependent, obtains the interesting negative dependent set awNIS of weighting completely after beta pruning;IAWNI(awNi)
Computing formula it is as follows:
Wherein, awItemsetInt (I1∪I2)=awsup(I1)×awsup(I1∪I2)×(1–awsup(I2))
awItemsetInt(I1∪ I2)=awsup(I1)×awsup(I2)×(awsup(I1)–awsup(I1∪I2))
AwItemsetInt (I1∪I2)=(1–awsup(I1))×(1–awsup(I2)×(awsup(I2)–awsup(I1
∪I2))
AwItemsetInt (I1∪ I2)=awsup(I2)×(1–awsup(I1))×(1–awsup(I1)–awsup
(I2)+awsup (I1∪I2))
(4)Excavate from interesting complete weighted frequent items set awPIS and effectively weight positive and negative correlation rule completely,
Comprise the following steps:
4.1st, frequent item set awL is taken out from interesting complete weighted frequent items set awPISi, obtain awLiIt is all true
Subset, builds awLiProper subclass set, then carry out following operation:
4.2.1, from awLiProper subclass set in arbitrarily take out two proper subclass I1And I2, work as I1And I2Common factor be empty set
(I1∩I2=φ), I1And I2Project number sum be equal to its former frequent item set project number (I1∪I2=awLi), and I1With
I2Support be all not less than support threshold (awsup (I1)≥minsup,awsup(I2) >=minsup), then calculate frequent episode
Collection (I1∪I2) item in weights than awIWR (I1,I2) and its dimension than awIDR (I1,I2);awIWR(I1,I2) and awIDR (I1,
I2) computing formula it is as follows:
w12And w1、w2Item collection (I is weighted completely respectively1,I2) and its Son item set I1And I2In complete weighted data storehouse AWD
Weights summation, k12, k1And k2Respectively item collection (I1,I2) and its Son item set I1And I2Project number, n be database in affairs
Record sum.
4.2.2, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 steps1,I2))
Product be more than its dimension ratio (awIDR (I1,I2)) when(That is n × awIWR (I1,I2)>awIDR(I1,I2)), grasped as follows
Make:
If 4.2.2.1 I1→I2AwCPIR values (awCPIR (I1→I2)) be not less than confidence threshold value minconf, then dig
Excavate all-weighted association I1→I2;If I2→I1AwCPIR values be not less than confidence threshold value (awCPIR (I2→I1)≥
Minconf), then excavate all-weighted association I2→I1;awCPIR(I1→I2) and awCPIR (I2→I1) calculating it is public
Formula is as follows:
If 4.2.2.2 (I1∪ I2) support be not less than support threshold (awsup (I1∪ I2)≥
Minsup), then, if 1. I1→ I2AwCPIR values be not less than confidence threshold value (awCPIR (I1→ I2)≥
Minconf), then excavate weighting negative customers rule I completely1→ I2;If 2. I2→ I1AwCPIR values be not less than
Confidence threshold value (awCPIR (I2→ I1) >=minconf), then excavate weighting negative customers rule I completely2→ I1;
Awsup (I1∪ I2), awCPIR (I1→ I2) and awCPIR (I2→ I1) computing formula it is as follows:
Awsup (I1∪ I2)=awsup (I1∪ I2)=1–awsup(I1)–awsup(I2)+awsup (I1∪I2)
4.2.3, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 steps1,I2))
Product be less than its dimension ratio (awIDR (I1,I2)) when(That is n × awIWR (I1,I2)<awIDR(I1,I2)), grasped as follows
Make:
If 4.2.3.1 (I1∪ I2) support be not less than support threshold (awsup (I1∪ I2) >=minsup),
So, if 1. I1→ I2AwCPIR values be not less than confidence threshold value (awCPIR (I1→ I2) >=minconf), then dig
Excavate weighting negative customers rule I completely1→ I2;If 2. I2→I1AwCPIR values be not less than confidence threshold value (awCPIR
(I2→I1) >=minconf), then excavate weighting negative customers rule I completely2→I1;awsup(I1∪ I2)、awCPIR
(I1→ I2) and awCPIR (I2→I1) computing formula it is as follows:
awsup(I1→ I2)=awsup(I1∪ I2)=awsup(I1)–awsup(I1∪I2)
If 4.2.3.2 (I1∪I2) support be not less than support threshold (awsup (I1∪I2) >=minsup),
So, if 1. I1→I2AwCPIR values be not less than confidence threshold value (awCPIR (I1→I2) >=minconf), then dig
Excavate weighting negative customers rule I completely1→I2;If 2. I2→ I1AwCPIR values be not less than confidence threshold value (awCPIR
(I2→ I1) >=minconf), then excavate weighting negative customers rule I completely2→ I1;Awsup (I1∪I2)、awCPIR
(I1→I2) and awCPIR (I2→ I1) computing formula it is as follows:
Awsup (I1→I2)=awsup (I1∪I2)=awsup(I2)–awsup(I1∪I2)
4.2.4,4.2.1~4.2.3 steps are continued, if awLiProper subclass set in each proper subclass and if only if
It is removed once, then proceeds to 4.2.5 steps;
4.2.5,4.1 steps are continued, if each frequent item set in interesting complete weighted frequent items set awPIS
awLiAll and if only if is removed once, then proceed to(5)Step;
(5)Excavate from the interesting negative dependent set awNIS of weighting completely, including
Following steps:
5.1st, negative dependent awN is taken out from the interesting negative dependent set of weighting completely awNISi, obtain awNiIt is all very son
Collection, builds awNiProper subclass set, then carry out following operation:
5.2.1, from awNiProper subclass set in arbitrarily take out two proper subclass I1And I2, work as I1And I2Common factor be empty set
(I1∩I2=φ), I1And I2Project number sum be equal to its former frequent item set project number (I1∪I2=awNi), and I1With
I2Support both greater than or be equal to support threshold (awsup (I1)≥minsup,awsup(I2) >=minsup), then calculate
Negative dependent (I1∪I2) item in weights ratio (awIWR (I1,I2)) and its dimension ratio (awIDR (I1,I2));awIWR(I1,I2) and
awIDR(I1,I2) computing formula with 4.2.1 formula.
5.2.2, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 steps1,I2))
Product be more than its dimension ratio (awIDR (I1,I2)) when(That is n × awIWR (I1,I2)>awIDR(I1,I2)), grasped as follows
Make:
If 5.2.2.1 (I1∪ I2) support be more than or equal to support threshold (awsup (I1∪ I2)
>=minsup), then, if 1. I1→ I2AwCPIR values be more than or equal to confidence threshold value (awCPIR (I1
→ I2) >=minconf), then excavate weighting negative customers rule I completely1→ I2;If 2. I2→ I1AwCPIR
Value is more than or equal to confidence threshold value (awCPIR (I2→ I1) >=minconf), then excavate
Then I2→ I1;Awsup (I1∪ I2), awCPIR (I1→ I2) and awCPIR (I2→ I1) computing formula
With the formula of 4.2.2.2.
5.2.3, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 steps1,I2))
Product be less than its dimension ratio (awIDR (I1,I2)) when(That is n × awIWR (I1,I2)<awIDR(I1,I2)):
If 5.2.3.1 (I1∪ I2) support be more than or equal to support threshold (awsup (I1∪ I2)≥
Minsup), then, if 1. I1→ I2AwCPIR values be more than or equal to confidence threshold value (awCPIR (I1→ I2)≥
Minconf), then excavate weighting negative customers rule I completely1→ I2;If 2. I2→I1AwCPIR values be more than or wait
In confidence threshold value (awCPIR (I2→I1) >=minconf), then excavate weighting negative customers rule I completely2→I1;
awsup(I1∪ I2)、awCPIR(I1→ I2) and awCPIR (I2→I1) computing formula with 4.2.3.1 formula;
If 5.2.3.2 (I1∪I2) support be more than or equal to support threshold (awsup (I1∪I2≥
Minsup), then, if 1. I1→I2AwCPIR values be more than or equal to confidence threshold value (awCPIR (I1→I2)≥
Minconf), then excavate weighting negative customers rule I completely1→I2;If 2. I2→ I1AwCPIR values be more than or wait
In confidence threshold value (awCPIR (I2→ I1) >=minconf), then excavate weighting negative customers rule I completely2→ I1;
Awsup (I1∪I2), awCPIR (I1→I2) and awCPIR (I2→ I1) computing formula with 4.2.3.2 formula;
5.2.4,5.2.1~5.2.3 steps are continued, if awNiProper subclass set in each proper subclass and if only if
It is removed once, then proceeds to 5.2.5 steps;
5.2.5,5.1 steps are continued, if each negative dependent awN in the interesting negative dependent set awNIS of weighting completelyiAll
And if only if is removed once, then weight positive and negative association rule mining completely and terminate;
So far, weight positive and negative association rule mining completely to terminate.
The present invention compared with prior art, has the advantages that:
(1)For the defect of the positive and negative association rule mining of existing weighting, the present invention is constructed
Formula evaluates framework:Support-CPIR models (Conditional Probability Increment Ratio)-correlation-emerging
Interesting degree, and the Pruning strategy of frequent item set and negative dependent, it is proposed that a kind of new adding completely based on SCPIRCI evaluation frameworks
Positive and negative association rule mining method is weighed, is efficiently solved.The present invention is not only
The complete weighted data feature that consideration project changes with data-base recording and changes, using new item collection Pruning strategy, during excavation
Between be greatly reduced, be greatly enhanced digging efficiency.
(2)Propose weights ratio and dimension in complete plus item collection item and, than concept, enrich the reason that complete weighted data is excavated
By.
(3)By a large amount of strict and careful experiments, by the present invention with traditional item without the positive and negative association rule mining of weighting
Method carries out experiment comparison.With Chinese Web test set CWT200g as testing wen chang qiao district collection, become from support change, confidence level
The excavation performance experiment Analysis of the aspect to the technology of the present invention such as change, the number of entry and document sets scale change.Experiment knot
Fruit shows:Compare with control methods, the excavation performance of the technology of the present invention has reached good effect, and digging efficiency is obtained greatly
Improve;Either in support threshold situation of change or confidence threshold value situation of change, the candidate item that the technology of the present invention is excavated
Collection, frequent item set and negative dependent and positive and negative correlation rule quantity are few many than what existing control methods was excavated;In item number
Under amount and affairs document scale situation of change, the present invention also shows good extensibility.Ananlysis of main cause is as follows:Contrast
Method is, without positive and negative association rule mining method is weighted, not account for a collects weights based on the excavation of project frequency, and not having can be complete
The characteristics of complete weighted data is intrinsic is reflected in face, thus, item collection and the positive and negative correlation rule of many invalid and falsenesses can be produced
Pattern so that the much larger number of item collection and rule, its digging efficiency lower significantly.The invention belongs to based on the complete of weights excavation
Positive and negative association rule mining method is weighted entirely, the inherent shortcoming of control methods is effectively overcomed, by complete weighted data model
Have the special feature that(I.e. objective being distributed in transaction journal with record change of project weights and change)Incorporate whole mining process
In so that the correlation rule for being excavated more rationally and closer to actual, meanwhile, employ new Pruning strategy so that it is invalid and
Barren frequent item set and negative dependent quantity are greatly reduced, and effectively reduce barren rule appearance, greatly increase
Digging efficiency.
Specific embodiment mode
In order to technical scheme is better described, below by complete weighted data model according to the present invention and phase
The concept of pass is described below:
1. the difference that weighted association rules are excavated and all-weighted association is excavated
Weighted association rules are excavated and all-weighted association is excavated, and their main distinction is that its project weights is originated
Different with the data model for being excavated, the former project weights are set by user is subjective, and independently of transaction database, once set
Fixed, invariable in whole mining process, for example, the copy paper and facsimile machine in shop, as copy paper price is not as passing
The height of prototype, its single-piece profit are lower than facsimile machine, different to the importance of profit contribution in order to embody commodity, and user is by single-piece
The higher facsimile machine commodity of profit give higher weights, and the weights of copy paper commodity are relatively low, after its weight setting, just
Immobilize, and independently of its transaction data base;The project weights of the latter are not to be set by the user, and are derived from affairs
It is in database each transaction journal and different with transaction journal and change, for example, each Feature Words in the text database of magnanimity
Project weights are derived from each document in its database, change as document is different, i.e., for different documents, its feature
Lexical item mesh weights are different.
Item weighted data model and all-weighted item data model are that weighted association rules are excavated and weighting is closed completely respectively
The data model of connection rule digging, is diverse two classes data model, as shown in Table 1 and Table 2, is wherein { i1,i2,...,
imIt is its project set, { T1,T2,...,TnIt is its affairs set.In weighted data model, { w1,w2,...,wmIt is which
Project weights, " 1 " of " 1/0 " represent that project occurs in transaction journal, and " 0 " represents absent variable situation.In complete weighted number
According to model, " w [Ti][ij]/0 (1≤i≤n, 1≤j≤m) " represents the weights of project, if project occurs in transaction journal,
Its weights is " w [Ti][ij] ", it is otherwise " 0 ".
1 weighted data model table of table, 2 all-weighted item data model
Example:Table 3 has 5 projects and 5 transaction journals, and wherein project set is { i1,i2,i3,i4,i5}={Apple,
Orange, Banana, Milk, Coca-cola }, as known from Table 3, i1T is not appeared in3In transaction journal.Table 4 is that an item is complete
Full weighted data example, project and transaction journal quantity and with table 3, wherein, project i1In transaction journal T1,T2,T3,T5In
Weights are 0.85,0.93,0.65,0.75 respectively, do not appear in transaction journal T4, therefore its weights is 0.
3 weighted data example tables of table, 4 all-weighted item data instance
2. complete weighted data excavates basic conception
If weighted data storehouse AWD={ T completely1,T2,...,Tn, number of transactions is n, TiIn (1≤i≤n) expression AWD i-th
Individual affairs, item collection I={ i1,i2,...,imWhole project sets in AWD are represented, item number is m, ij(1≤j≤m) represents AWD
In j-th project, w [Ti][ij] (1≤i≤n, 1≤j≤m) expression project ijIn transaction journal TiIn weights, refer to table 2
All-weighted item data model.If I1,I2It is the Son item set of item collection I,And,Provide following substantially fixed
Justice:
Define 1 (complete weighted support measure:All-weighted support, abbreviation awsup):Complete weighted support measure
Shown in the computing formula such as formula (1) of awsup (I).
Wherein,, n is the transaction journal sum of complete weighted data storehouse AWD, and k is the length of item collection I
(That is the project number of I).
Negative dependent and negative customers rule support such as formula (2) is weighted completely to formula (5) Suo Shi.
Awsup (I)=1 awsup (I) (2)
awsup(I1→ I2)=awsup(I1∪ I2)=awsup(I1)–awsup(I1∪I2) (3)
Awsup (I1→I2)=awsup (I1∪I2)=awsup(I2)–awsup(I1∪I2) (4)
Awsup (I1→ I2)=awsup (I1∪ I2)=1–awsup(I1)–awsup(I2)+awsup (I1∪I2) (5)
Define 2 (complete weighted frequent items and negative dependents):If minimum support threshold value is minsup, for weighting completely
Item collection I, if awsup (I) >=minsup, item collection I is called complete weighted frequent items.For weighting item collection (I completely1∪I2),
Work as I1And I2When being all frequent item set, if awsup is (I1∪I2)<Minsup, then item collection (I1∪I2) be referred to as weighting negative dependent completely.
Example:If minsup=0.1, in 4 data of table, awsup (i2)=(0.21+0.35+0.05)/(5×1)=0.122>
Minsup, awsup (i4)=0.192>Minsup, awsup (i2∪i4)=0.06<Minsup, therefore item collection (i2∪i4) it is to add completely
Power negative dependent.
Define 3 and (weight item collection interest-degree completely:All-weighted Itemset Interest, i.e.,
awItemsetInt):Interest-degree is the tolerance of association mode degree of concern of the user to being excavated, and its value is higher, illustrates the pass
Gang mould formula is noveler, and user is higher to its degree of concern.Based on the interest-degree model excavated without weighted data under environment
(Cheng Jihua, Guo Jiansheng, Shi Pengfei. excavate many strategy process research [J] of rule of interest. Chinese journal of computers, 2000,23
(1):47-51.), be given:
awItemsetInt(I1∪I2)=awsup(I1)×awsup(I1∪I2)×(1–awsup(I2)) (6)
awItemsetInt(I1∪ I2)=awsup(I1)×awsup(I2)×(awsup(I1)–awsup(I1∪I2)) (7)
AwItemsetInt (I1∪I2)=(1–awsup(I1))×(1–awsup(I2)×(awsup(I2)–awsup(I1∪I2)) (8)
AwItemsetInt (I1∪ I2)=awsup(I2)×(1–awsup(I1))×(1–awsup(I1)–awsup
(I2)+awsup (I1∪I2)) (9)
Define 4 and (weight CPIR values completely:All-weighted Conditional_Probability Increment
Ratio, abbreviation awCPIR):CPIR models are expressing p (I with the ratio of conditional probability and prior probability2/I1) relative p (I2)
Incremental degree, give its computing formula in document:CPIR(I2/I1)=(p(I2/I1)–p(I2))/(1–p(I2)).It is based on
The needs that the computing formula of CPIR models and complete weighted data are excavated, provide the awCPIR for weighting positive and negative correlation rule completely
Computing formula such as formula (10) is to formula (13) Suo Shi:
Using awCPIR values as all-weighted association confidence level, its value is bigger, illustrates the credible of the correlation rule
Degree is higher, is more paid close attention to by user.
Example:In 4 complete data of table, awsup (i1)=0.636, awsup (i1)=1-0.636=0.364, awsup (i2)=
0.122, awsup (i1∪i2)=0.294, awCPIR (i1→i2)=(| 0.294-0.636 × 0.122 |)/(0.636 × (1-
0.122))=0.39, awCPIR (i1→ i2)=2.79, awCPIR (i1→i2)=0.68, awCPIR (i1→ i2)=
4.86。
Define 5 (weights ratios in complete weighted term:All-weighted Weight Ratio from Itemset, referred to as
awIWR):If w12And w1、w2Item collection (I is weighted completely respectively1,I2) and its Son item set I1And I2In complete weighted data storehouse AWD
In weights summation, by w12(w1×w2) ratio referred to as completely weight item collection in weights ratio, weights ratio in abbreviation item
(awIWR(I1,I2)), i.e., shown in formula (14).
Define 6 (dimension ratios in complete weighted term:All-weighted Dimension Ratio from Itemset, letter
Claim awIDR):If k12, k1And k2Respectively item collection (I1,I2) and its Son item set I1And I2Project number, by k12(k1×k2)
Ratio referred to as completely weight item collection in dimensional ratio, dimension ratio (awIDR (I in abbreviation item1,I2)), i.e., shown in formula (15).
Define 7 and (weight item collection correlation completely:All-weighted itemset correlation, referred to as
awISCorr):(Chengqi Zhang, Shichao Zhang.Association is defined based on traditional item collection correlation
rule mining:models and algorithms[M].Springer-Verlag Berlin,Heidelberg,2002:
47-84,ISBN:3-540-43533-6.), provide weighting item collection (I completely1,I2) correlation (awISCorr (I1,I2),) computing formula such as formula (16) shown in.
According to the property of correlation, excavate under environment in complete weighted data, item collection (I1,I2) correlation has following property
Matter:
Property 1:
Property 2:
Property 3:
Property 4:2. awISCorr (I1,I2)<1;③
AwISCorr (I1, I2)>1。
Property 5:2. awISCorr (I1,I2)>1;③
AwISCorr (I1, I2)<1。
Inference is excavated in environment in complete weighted data, it is known that item collection (I1,I2), andIf 1. n × awIWR
(I1,I2)>awIDR(I1,I2), then Son item set I is weighted completely1And I2Into positive correlation, and can excavate and weight positive association completely
Regular I1→I2With negative customers rule I1→ I2Pattern;If 2. n × awIWR (I1,I2)<awIDR(I1,I2), then weight completely
Item collection I1And I2Into negative correlation, and weighting negative customers rule I completely can be excavated1→ I2And I1→I2Pattern;
According to above-mentioned inference, when all-weighted association is excavated, only need to calculate weights in complete weighted term compares awIWR
(I1,I2) and dimension than awIDR (I1,I2), it is not required to calculate item collection correlation, it is possible to directly from frequent item set and negative dependent
Excavation weights positive and negative correlation rule completely.
Example:For (i1,i2,i3), if I1=(i1,i2), I2=(i3), then awIWR (I1,I2)=3.34/(2.94×
2.85)=0.399, awIDR (I1,I2)=3/ (2 × 1)=1.5, n × awIWR (I1,I2)=5×0.5517=1.995>1.5=
awIDR(I1,I2), according to above-mentioned inference, I1And I2Into positive correlation, correlation rule I can be excavated1→I2With negative customers rule I1
→ I2Pattern.Verified using formula (16):awsup(i1∪i2)=0.294, awsup (i3)=0.57, awsup (i1∪i2∪i3)=
0.223, awISCorr (I1,I2)=0.223/(0.294×0.57)=1.33>1, by property 1 and property 4, I1And I2Into positive
Close, correlation rule I can be excavated1→I2With negative customers rule I1→ I2Pattern, conclusion are consistent.
In the same manner, for weighting item collection (i completely2,i4), its awIWR (i2,i4)=0.102, awIDR (i2,i4)=2, n ×
awIWR(i2,i4)=0.51<2=awIDR(i2,i4), according to inference, i2And i4Into negative correlation, i can be excavated2→ i4
And i2→i4Pattern.
Define 8 (effectively weighting positive and negative correlation rule completely):If minconf is minimal confidence threshold, when completely plus
Claim collection I1And I2Meet following 3 conditions, then claim correlation rule I1→I2, I1→ I2、I1→ I2And I1→I2For having
The positive and negative correlation rule of weighting completely of effect:①I1And I2It is complete weighted frequent items, I1∩I2=φ;②I1→I2, I1→
I2、I1→ I2And I1→I2Support be more than or equal to minsup;③I1→I2, I1→ I2、I1→ I2And I1→
I2AwCPIR values be not less than minconf.
Example:Assume minsup=0.1, minconf=0.3, know from upper example, weight item collection (i completely1,i2)、(i3)
(i1,i2,i3) support be both greater than minsup, (i1,i2) and (i3) into positive correlation, and because, awCPIR ((i1,i2)→
(i3))=|0.223–0.94×0.57|/(0.294×(1–0.57))=0.438>Minconf, awCPIR ((i1,i2) →
(i3))=0.138<Minconf, according to property 4 and definition 8, (i1,i2)→(i3) it is that an effective positive association that weights completely is advised
Then, negative rule (i1,i2) → (i3) it is not effective.In the same manner, for weighting item collection (i completely2,i4), due to awsup
(i2)=0.122>Minsup, awsup (i4)=0.192>Minsup, awsup (i2∪ i4)=0.062<Minsup, awsup (
i2∪i4)=0.132>Minsup, awCPIR (i2→i4)=0.052<Minconf, according to definition 8, negative customers rule i2→
i4And i2→i4It is not effectively to weight negative customers rule completely.
Technical scheme is described further below by specific embodiment.
Process following (wherein, minsup of the present invention to 4 complete weighted data Case digging all-weighted association of table
=0.1, minInt=0.1, minconf=0.4, w represent a collects weights, behalf item collection support):
Step1:awPIS={φ};awNIS={φ};
Step2:
Step3:① ② ③
Step4:Beta pruning:For the item collection beta pruning in frequent item set set awPIS.The frequent item set wiped out is:(i2,
i3),(i3,i4),(i1,i2,i5),(i1,i3,i5), the awPIS={ (i after beta pruning1,i2),(i1,i3),(i1,i5),(i1,i2,
i3)}
Step5:In the same manner, in negative dependent set awNIS, the negative dependent wiped out is:(i3,i5), the awNIS=after beta pruning
{(i1,i4),(i2,i4),(i2,i5),(i4,i5)}。
Step6:Excavate from frequent item set set awPIS and in negative dependent set awNIS and weight completely positive negative customers rule
Then, with frequent item set (i1,i2,i3) and negative dependent (i4,i5) as a example by, provide its mining process as follows:
For frequent item set (i1,i2,i3), with its subset I1=(i1) and I2=(i2,i3) as a example by, knowable in upper example,
awsup(i1)、awsup(i2,i3) it is all higher than minsup, awIDR (I1,I2)=1.5, n × awIWR (I1,I2)=2.98>awIDR
(I1,I2), awsup (I1∪I2)=0.223>Minsup, awCPIR (I1→I2)=0.212<Minconf, awCPIR (I2→I1)=
1.73>minconf;Awsup (I1∪ I2)=0.411>Minsup, awCPIR (I1→ I2)=1.73>Minconf,
AwCPIR (I2→ I1)=0.212<Minconf, therefore, I2→I1And I1→ I2(i.e. (i2,i3)→(i1) and (i1)
→ (i2,i3)) it is effectively to weight positive and negative correlation rule completely.
For negative dependent (i4,i5), its subset I1=(i4) and I2=(i5), knowable in upper example, awsup (i4)、awsup
(i5) it is all higher than minsup, awIDR (I1,I2)=2, n × awIWR (I1,I2)=1.03<awIDR(I1,I2), awsup (I1∪
I2)=0.101>Minsup, awsup (I1∪I2)=0.093<Minsup, awCPIR (I1→ I2)=1.577>Minconf,
AwCPIR (I2→I1)=0.084<Minconf, therefore, I1→ I2(i.e. (i4) → (i5)) it is effectively to weight completely
Negative customers rule.
Beneficial effects of the present invention are described further below by experiment.
In order to verify effectiveness of the invention, correctness and autgmentability, we select to be carried by network laboratories of Peking University
For Chinese Web test set CWT200g (Chinese Web Test Collection with200GB web pages) portion
Divide language material as this paper experimental data test sets.The running environment of experiment is Intel (R) Core (TM) i7-3770CPU@
3.4GHz3.4GHz, internal memory 4.0G, operating system is windows7, and programming language is realized adopting delphi2006, data base set
Unite as SQL Server2008.Select typically without the positive and negative association rule mining method of weighting(Xindong Wu,Chengqi
Zhang,and Shichao Zhang,Efficient Mining of Both Positive and Negative
Association Rules,ACM Transactions on Information Systems,22(2004),3:381-
405.)(being designated as PNAR-Mining methods) is Experimental comparison's method.
The capacity of Chinese Web test set CWT200g is 197GB, comprising 37,482,913 webpages, and each page is according to day
Net storage format is compressed arrangement.It is extracted 12024 plain text documents from CWT200g test sets to survey as experiment document
Examination collection.Using Chinese lexical analysis system ICTCLAS (Inst. of Computing Techn. Academia Sinica develops and writes) to test text
Document participle.Feature Words weights (wij) computing formula be wij=(0.5+0.5×tfij/maxj(tfij))×idfi.Experiment is surveyed
Examination document preprocessing process be:Participle, stop words is removed, is extracted and Feature Words and is calculated its weights, being built based on vector space mould
The text database and feature dictionary of type.After the collection pretreatment of experiment wen chang qiao district, 8751 Feature Words, its document frequency is obtained
(number of documents i.e. containing this feature word) df is 51 to 11258.According to needs are excavated, remove df values in experiment than relatively low and ratio
Higher Feature Words, extract the Feature Words that df values are 1500 to 5838(400 Feature Words are obtained now)Construction feature lexical item
Mesh storehouse.Total frequency that Feature Words occur in 12024 experiment test documents is 1019494 times, is averagely gone out in every document
It is existing 85 times.Experiment parameter is as shown in table 5.
5 experiment parameter table of table
Experiment 1:Performance comparision is excavated under support threshold situation of change
Under different support thresholds, AWPNAR-Mining and control methods PNAR-Mining is invented herein in experiment text
Item collection (i.e. candidate (Candidate Itemset, CI), frequent item set (Frequent is excavated in shelves test set
Itemset, FI), negative dependent (Negative Itemset, NI)) and positive and negative correlation rule (Positive and Negative
Association Rule, PNAR) quantity compares (ItemNum=50, minconf=0.0002, minInt as shown in Figures 3 to 8
=0.0002,TRecordNum=12024)。
Experiment 2:Performance comparision is excavated under confidence threshold value situation of change
AWPNAR-Mining and control methods PNAR-Mining are invented under confidence threshold value situation of change herein in experiment
Wen chang qiao district concentrates excavation positive and negative correlation rule (A → B, A → B, A → B and A → B) quantity more as shown in table 6
(minsup=0.03, minInt=0.0002, ItemNum=50, TRecordNum=12024).
The positive and negative correlation rule quantity excavated under 6 different confidence threshold values of table compares
Experiment 3:Excavate time efficiency Performance comparision
Time efficiency performances are excavated in order to compare 2 kinds of methods, we are respectively under support threshold situation of change and confidence
The excavation time for inventing AWPNAR-Mining and control methods PNAR-Mining herein is counted in the case of degree changes of threshold, its knot
Fruit is as shown in table 7 and table 8 (minInt=0.0002, ItemNum=50, TRecordNum=12024).Table 7 represents support threshold
The lower 2 kinds of method for digging of situation of change concentrate the time for excavating item collection and correlation rule to compare (minconf=in experiment wen chang qiao district
0.0002), table 8 represents that the excavation positive and negative correlation rule time under confidence threshold value situation of change compares (minsup=0.03).
Item collection and correlation rule time (unit are excavated under 7 different support thresholds of table:Second) compare
Time (the unit of positive and negative correlation rule is excavated under 8 different confidence threshold values of table:Second) compare
Experiment 4:Scalable Performance is analyzed
We change extensibility of two kinds of situations to the inventive method from number of entry change and data test set scale
Can experiment and analysis.
In order to test the extensibility of the present invention, experiment parameter is set:ItemNum=50, TRecordNum=12024,
Minsup=0.05, minconf=0.07, minInt=0.001, change respectively in number of entry change and data test set scale
In the case of, AWPNAR-Mining methods of the present invention Mining Frequent Itemsets Based (FI), negative dependent (NI) and just in data test collection 1
Negative customers rule (PNAR) isotype number change result is as shown in Fig. 9 to Figure 14.
In a word, it is above-mentioned test result indicate that, compare with control methods PNAR-Mining, AWPNAR-Mining side of the present invention
The excavation performance of method has reached good effect, and digging efficiency is greatly improved;Either change feelings in support threshold
Condition or confidence threshold value situation of change, candidate, frequent item set and negative dependent and positive negative customers rule that the present invention is excavated
Then quantity is few many than control methods.