CN103838854B

CN103838854B - Completely-weighted mode mining method for discovering association rules among texts

Info

Publication number: CN103838854B
Application number: CN201410096985.2A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2014-03-14
Filing date: 2014-03-14
Publication date: 2017-03-22
Anticipated expiration: 2034-03-14
Also published as: CN103838854A

Abstract

The invention discloses a completely-weighted mode mining method for discovering association rules among texts. Completely-weighted data to be processed are pre-processed, and a completely-weighted database and an item .library are established; a completely-weighted frequent item set and a negative item set are mined, and an interesting completely-weighted frequent item set and an interesting negative item set are obtained through pruning; the effective completely-weighted positive and negative association rules are mined through a support degree-CPIR model-correlation-interestingness evaluation framework. The completely-weighted mode mining method can overcome the defects of the existing weighing mining technology. Item weights are objectively distributed in the database and integrated with the completely-weighted mode mining method along with the completely-weighted data characteristics of the business record change, and a more actual and reasonable completely-weighted positive and negative association mode can be obtained. An invalid and uninteresting association mode is avoided. The number of mined candidate items, the number of frequent item sets, the number of negative item sets and the number of positive and negative association rule modes are smaller than the number of mined candidate items, the number of frequent item sets, the number of negative item sets and the number of positive and negative association rule modes in the prior art. The mining efficiency is greatly improved, and the completely-weighted mode mining method has good extendibility.

Description

For finding the complete weighting pattern method for digging of correlation rule between text word

Technical field

The invention belongs to Data Mining, specifically a kind of weighting completely for finding correlation rule between text word is just Negative mode method for digging, it is adaptable to the neck such as the discovery of Feature Words association mode and document information retrieval query expansion in text mining Domain.

Background technology

Over nearly 20 years, association rule mining obtains the great interest of numerous scholars and research, has become data mining and grinds One of focus studied carefully, its research are concentrated mainly on Face.

Equality that what the positive and negative association mode based on project frequency was excavated be mainly characterized by item as one man in processing data storehouse Mesh, excavates association mode using the probability that item collection occurs in database as support.Correlation rule based on project frequency digs Digging the defect for existing is：Only pay attention to project frequency, neglected items weights frequently result in redundancy, barren and invalid association Rule increases.

In order to overcome the defect of above-mentioned association rule mining method, obtained based on the positive and negative association rule mining of project weights Pay attention to and study, which introduce a weight, to have different importance and project to have in database between embodiment project There are different weights.The positive and negative association rule mining of weighting is divided into based on the positive and negative association rule mining of project weights and is weighted completely Positive and negative association rule mining.Its project weights that are mainly characterized by for weighting positive and negative association rule mining are embodied Different importance, with going deep into for research, weights the effect day aobvious protrusion of negative customers rule, while favorable factor is excavated It is also desirable that and finds some unfavorable factors, this purpose can be reached by the analysis of negative customers rule.What weighted association rules were excavated Defect is to have ignored project weights to have a case that different weights in database each transaction journal.By project weights objective point The data for being distributed in transaction journal and changing with record change are referred to as complete weighted data.Existing weighted association rules method for digging Complete weighted data can not be suitable for excavate, for this purpose, since 2003, all-weighted association Research on Mining obtained concern and Research, currently, weight completely positive and negative Association Rule Mining the fields such as text mining, information retrieval have it is important theoretical and Using value.The defect that all-weighted association method for digging can be excavated efficiently against weighted association rules, but also not Can solve to weight negative customers rule digging technical problem completely.For these problems, the present invention is to weighting positive negative customers rule completely Then excavate and furtherd investigate, propose that a kind of positive and negative correlation rule of new weighting completely based on weights ratio and dimension ratio in item digs Pick method, is applied to document information retrieval query expansion, can improve retrieval performance, be applied to text mining, it can be found that more Plus actual rational positive negative feature words association mode.

The content of the invention

Present invention aims to the deficiency that prior art is present, there is provided a kind of to advise for finding to associate between text word Complete weighting pattern method for digging then, enriches the Association Rule Mining achievement excavated based on project weights, solves item complete Weight full the technical barrier in positive and negative association rule mining.The method has important in fields such as text mining, document information retrievals Theory value and wide application prospect.

The present invention realizes that above-mentioned purpose is adopted the technical scheme that：It is a kind of for finding the complete of correlation rule between text word Full weighting pattern method for digging, comprises the steps：

（1）Complete weighted data pretreatment stage：

In real world, there is the complete weighted data of magnanimity, such as Text Information Data etc..Weighted data pretreatment completely Method will depending on specific data object, for example, for Chinese text data message, then to carry out participle, remove stop words, Extract the preprocess method such as Feature Words and its weight computing；For English text data message, preprocess method be stem extract, Exclude stop words, lexical analysis, extract Feature Words and its weight computing etc..The result of weighted data pretreatment completely is to build base In complete weighted data storehouse and project library；

For the Feature Words weight computing formula of text data is：w_ij=(0.5+0.5×tf_ij/max_j(tf_ij))×idf_i,

Wherein, w_ijFor ith feature word jth piece document weights, tf_ijIt is ith feature word in jth piece document Word frequency, idf_iFor the reverse document frequency of ith feature word, its value idf_i=log(N/df_i), N is total number of documents in document sets, df_iIt is the number of documents containing ith feature word.

（2）Weighted frequent items and negative dependent excavation phase, comprise the following steps 2.1 and step 2.2 completely：

2.1st, extract from project library and weight completely candidate's 1_ item collections awC₁, and excavation weights frequent 1_ item collections awL completely₁； Concrete steps are carried out according to 2.1.1～2.1.3：

2.1.1, extract from project library and weight completely candidate's 1_ item collections awC₁；

2.1.2, add up and weight candidate's 1_ item collections awC completely₁In complete weighted data storehouse (All-Weighted Database, abbreviation AWD) in weights summation, calculate its support；

awC₁Support computing formula is as follows：

Wherein,Expression project i_jIn transaction journal T_iIn weights summation, n for completely plus The transaction journal sum of power database AWD, k is item collection awC₁Length（That is awC₁Project number）.

2.1.3, candidate's 1_ item collections C will be weighted completely₁Middle support is more than or equal to minimum support threshold value minsup Frequent 1_ item collections awL are weighted completely₁It is added to frequent item set set awPIS；

2.2nd, from the beginning of candidate's 2_ item collections are weighted completely, operated according to step 2.2.1～2.2.4：

2.2.1 frequent (i-1) _ item collection awL, will be weighted completely_i-1Apriori connections are carried out, is generated Item collection awC_i；Described i >=2；

2.2.2, add up and weight candidate's i_ item collections awC completely_i-1Weights summation in complete weighted data storehouse AWD, calculates Its support awsup (awC_i-1), its computing formula is as follows：

Wherein,Expression project i_jIn transaction journal T_iIn weights summation, n has been The transaction journal sum of full weighted data storehouse AWD, k is item collection awC_i-1Length.

2.2.3, from weighting candidate's i_ item collections awC completely_iThe middle frequency that its support is not less than support threshold minsup Numerous i_ item collections awL_iTake out, be stored in complete weighted frequent items set awPIS, meanwhile, its support is less than into support threshold Negative i_ item collections awN of weighting completely_iIt is stored in weighting negative dependent set awNIS completely.

2.2.4 the value of i is added 1, if frequently (i-1) _ item collection awL_i-1For sky（It is that its length is 0）Proceed to（3）Step, Otherwise, 2.2.1～2.2.3 steps are continued；

（3）The beta pruning stage：Interesting complete weighted frequent items and negative dependent are obtained by the beta pruning stage

3.1st, for each frequent i- item collection awL in frequent item set set awPIS_i, calculate IAWFI (awL_i) value, cut Except its IAWFI (awL_i) value is false frequent item set, obtains interesting complete weighted frequent items set awPIS after beta pruning; IAWFI(awL_i) computing formula is as follows：

Wherein, awItemsetInt (I₁∪I₂)=awsup(I₁)×awsup(I₁∪I₂)×(1–awsup(I₂)), AwItemsetInt (I₁, I₂)=awsup(I₂)×(1–awsup(I₁))×(1–awsup(I₁)–awsup(I₂)+awsup (I₁∪I₂)), minInt be minimum interestingness threshold value, minsup minimum support threshold values.

3.2nd, for each negative i- item collection awN in negative dependent set awNIS_i, calculate IAWNI (awN_i) value, wipe out which IAWNI(awN_i) value is false negative dependent, obtains the interesting negative dependent set awNIS of weighting completely after beta pruning;IAWNI(awN_i) Computing formula it is as follows：

Wherein, awItemsetInt (I₁∪I₂)=awsup(I₁)×awsup(I₁∪I₂)×(1–awsup(I₂))

awItemsetInt(I₁∪ I₂)=awsup(I₁)×awsup(I₂)×(awsup(I₁)–awsup(I₁∪I₂))

AwItemsetInt (I₁∪I₂)=(1–awsup(I₁))×(1–awsup(I₂)×(awsup(I₂)–awsup(I₁ ∪I₂))

AwItemsetInt (I₁∪ I₂)=awsup(I₂)×(1–awsup(I₁))×(1–awsup(I₁)–awsup (I₂)+awsup (I₁∪I₂))

（4）Excavate from interesting complete weighted frequent items set awPIS and effectively weight positive and negative correlation rule completely, Comprise the following steps：

4.1st, frequent item set awL is taken out from interesting complete weighted frequent items set awPIS_i, obtain awL_iIt is all true Subset, builds awL_iProper subclass set, then carry out following operation：

4.2.1, from awL_iProper subclass set in arbitrarily take out two proper subclass I₁And I₂, work as I₁And I₂Common factor be empty set (I₁∩I₂=φ), I₁And I₂Project number sum be equal to its former frequent item set project number (I₁∪I₂=awL_i), and I₁With I₂Support be all not less than support threshold (awsup (I₁)≥minsup,awsup(I₂) >=minsup), then calculate frequent episode Collection (I₁∪I₂) item in weights than awIWR (I₁,I₂) and its dimension than awIDR (I₁,I₂)；awIWR(I₁,I₂) and awIDR (I₁, I₂) computing formula it is as follows：

w₁₂And w₁、w₂Item collection (I is weighted completely respectively₁,I₂) and its Son item set I₁And I₂In complete weighted data storehouse AWD Weights summation, k₁₂, k₁And k₂Respectively item collection (I₁,I₂) and its Son item set I₁And I₂Project number, n be database in affairs Record sum.

4.2.2, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 steps₁,I₂)) Product be more than its dimension ratio (awIDR (I₁,I₂)) when（That is n × awIWR (I₁,I₂)>awIDR(I₁,I₂)）, grasped as follows Make：

If 4.2.2.1 I₁→I₂AwCPIR values (awCPIR (I₁→I₂)) be not less than confidence threshold value minconf, then dig Excavate all-weighted association I₁→I₂；If I₂→I₁AwCPIR values be not less than confidence threshold value (awCPIR (I₂→I₁)≥ Minconf), then excavate all-weighted association I₂→I₁；awCPIR(I₁→I₂) and awCPIR (I₂→I₁) calculating it is public Formula is as follows：

If 4.2.2.2 (I₁∪ I₂) support be not less than support threshold (awsup (I₁∪ I₂)≥ Minsup), then, if 1. I₁→ I₂AwCPIR values be not less than confidence threshold value (awCPIR (I₁→ I₂)≥ Minconf), then excavate weighting negative customers rule I completely₁→ I₂；If 2. I₂→ I₁AwCPIR values be not less than Confidence threshold value (awCPIR (I₂→ I₁) >=minconf), then excavate weighting negative customers rule I completely₂→ I₁； Awsup (I₁∪ I₂), awCPIR (I₁→ I₂) and awCPIR (I₂→ I₁) computing formula it is as follows：

Awsup (I₁∪ I₂)=awsup (I₁∪ I₂)=1–awsup(I₁)–awsup(I₂)+awsup (I₁∪I₂)

4.2.3, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 steps₁,I₂)) Product be less than its dimension ratio (awIDR (I₁,I₂)) when（That is n × awIWR (I₁,I₂)<awIDR(I₁,I₂)）, grasped as follows Make：

If 4.2.3.1 (I₁∪ I₂) support be not less than support threshold (awsup (I₁∪ I₂) >=minsup), So, if 1. I₁→ I₂AwCPIR values be not less than confidence threshold value (awCPIR (I₁→ I₂) >=minconf), then dig Excavate weighting negative customers rule I completely₁→ I₂；If 2. I₂→I₁AwCPIR values be not less than confidence threshold value (awCPIR (I₂→I₁) >=minconf), then excavate weighting negative customers rule I completely₂→I₁；awsup(I₁∪ I₂)、awCPIR (I₁→ I₂) and awCPIR (I₂→I₁) computing formula it is as follows：

awsup(I₁→ I₂)=awsup(I₁∪ I₂)=awsup(I₁)–awsup(I₁∪I₂)

If 4.2.3.2 (I₁∪I₂) support be not less than support threshold (awsup (I₁∪I₂) >=minsup), So, if 1. I₁→I₂AwCPIR values be not less than confidence threshold value (awCPIR (I₁→I₂) >=minconf), then dig Excavate weighting negative customers rule I completely₁→I₂；If 2. I₂→ I₁AwCPIR values be not less than confidence threshold value (awCPIR (I₂→ I₁) >=minconf), then excavate weighting negative customers rule I completely₂→ I₁；Awsup (I₁∪I₂)、awCPIR (I₁→I₂) and awCPIR (I₂→ I₁) computing formula it is as follows：

Awsup (I₁→I₂)=awsup (I₁∪I₂)=awsup(I₂)–awsup(I₁∪I₂)

4.2.4,4.2.1～4.2.3 steps are continued, if awL_iProper subclass set in each proper subclass and if only if It is removed once, then proceeds to 4.2.5 steps；

4.2.5,4.1 steps are continued, if each frequent item set in interesting complete weighted frequent items set awPIS awL_iAll and if only if is removed once, then proceed to（5）Step；

（5）Excavate from the interesting negative dependent set awNIS of weighting completely, including Following steps：

5.1st, negative dependent awN is taken out from the interesting negative dependent set of weighting completely awNIS_i, obtain awN_iIt is all very son Collection, builds awN_iProper subclass set, then carry out following operation：

5.2.1, from awN_iProper subclass set in arbitrarily take out two proper subclass I₁And I₂, work as I₁And I₂Common factor be empty set (I₁∩I₂=φ), I₁And I₂Project number sum be equal to its former frequent item set project number (I₁∪I₂=awN_i), and I₁With I₂Support both greater than or be equal to support threshold (awsup (I₁)≥minsup,awsup(I₂) >=minsup), then calculate Negative dependent (I₁∪I₂) item in weights ratio (awIWR (I₁,I₂)) and its dimension ratio (awIDR (I₁,I₂))；awIWR(I₁,I₂) and awIDR(I₁,I₂) computing formula with 4.2.1 formula.

5.2.2, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 steps₁,I₂)) Product be more than its dimension ratio (awIDR (I₁,I₂)) when（That is n × awIWR (I₁,I₂)>awIDR(I₁,I₂)）, grasped as follows Make：

If 5.2.2.1 (I₁∪ I₂) support be more than or equal to support threshold (awsup (I₁∪ I₂) >=minsup), then, if 1. I₁→ I₂AwCPIR values be more than or equal to confidence threshold value (awCPIR (I₁ → I₂) >=minconf), then excavate weighting negative customers rule I completely₁→ I₂；If 2. I₂→ I₁AwCPIR Value is more than or equal to confidence threshold value (awCPIR (I₂→ I₁) >=minconf), then excavate Then I₂→ I₁；Awsup (I₁∪ I₂), awCPIR (I₁→ I₂) and awCPIR (I₂→ I₁) computing formula With the formula of 4.2.2.2.

5.2.3, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 steps₁,I₂)) Product be less than its dimension ratio (awIDR (I₁,I₂)) when（That is n × awIWR (I₁,I₂)<awIDR(I₁,I₂)）：

If 5.2.3.1 (I₁∪ I₂) support be more than or equal to support threshold (awsup (I₁∪ I₂)≥ Minsup), then, if 1. I₁→ I₂AwCPIR values be more than or equal to confidence threshold value (awCPIR (I₁→ I₂)≥ Minconf), then excavate weighting negative customers rule I completely₁→ I₂；If 2. I₂→I₁AwCPIR values be more than or wait In confidence threshold value (awCPIR (I₂→I₁) >=minconf), then excavate weighting negative customers rule I completely₂→I₁； awsup(I₁∪ I₂)、awCPIR(I₁→ I₂) and awCPIR (I₂→I₁) computing formula with 4.2.3.1 formula；

If 5.2.3.2 (I₁∪I₂) support be more than or equal to support threshold (awsup (I₁∪I₂≥ Minsup), then, if 1. I₁→I₂AwCPIR values be more than or equal to confidence threshold value (awCPIR (I₁→I₂)≥ Minconf), then excavate weighting negative customers rule I completely₁→I₂；If 2. I₂→ I₁AwCPIR values be more than or wait In confidence threshold value (awCPIR (I₂→ I₁) >=minconf), then excavate weighting negative customers rule I completely₂→ I₁； Awsup (I₁∪I₂), awCPIR (I₁→I₂) and awCPIR (I₂→ I₁) computing formula with 4.2.3.2 formula；

5.2.4,5.2.1～5.2.3 steps are continued, if awN_iProper subclass set in each proper subclass and if only if It is removed once, then proceeds to 5.2.5 steps；

5.2.5,5.1 steps are continued, if each negative dependent awN in the interesting negative dependent set awNIS of weighting completely_iAll And if only if is removed once, then weight positive and negative association rule mining completely and terminate；

So far, weight positive and negative association rule mining completely to terminate.

The present invention compared with prior art, has the advantages that：

（1）For the defect of the positive and negative association rule mining of existing weighting, the present invention is constructed Formula evaluates framework：Support-CPIR models (Conditional Probability Increment Ratio)-correlation-emerging Interesting degree, and the Pruning strategy of frequent item set and negative dependent, it is proposed that a kind of new adding completely based on SCPIRCI evaluation frameworks Positive and negative association rule mining method is weighed, is efficiently solved.The present invention is not only The complete weighted data feature that consideration project changes with data-base recording and changes, using new item collection Pruning strategy, during excavation Between be greatly reduced, be greatly enhanced digging efficiency.

（2）Propose weights ratio and dimension in complete plus item collection item and, than concept, enrich the reason that complete weighted data is excavated By.

（3）By a large amount of strict and careful experiments, by the present invention with traditional item without the positive and negative association rule mining of weighting Method carries out experiment comparison.With Chinese Web test set CWT200g as testing wen chang qiao district collection, become from support change, confidence level The excavation performance experiment Analysis of the aspect to the technology of the present invention such as change, the number of entry and document sets scale change.Experiment knot Fruit shows：Compare with control methods, the excavation performance of the technology of the present invention has reached good effect, and digging efficiency is obtained greatly Improve；Either in support threshold situation of change or confidence threshold value situation of change, the candidate item that the technology of the present invention is excavated Collection, frequent item set and negative dependent and positive and negative correlation rule quantity are few many than what existing control methods was excavated；In item number Under amount and affairs document scale situation of change, the present invention also shows good extensibility.Ananlysis of main cause is as follows：Contrast Method is, without positive and negative association rule mining method is weighted, not account for a collects weights based on the excavation of project frequency, and not having can be complete The characteristics of complete weighted data is intrinsic is reflected in face, thus, item collection and the positive and negative correlation rule of many invalid and falsenesses can be produced Pattern so that the much larger number of item collection and rule, its digging efficiency lower significantly.The invention belongs to based on the complete of weights excavation Positive and negative association rule mining method is weighted entirely, the inherent shortcoming of control methods is effectively overcomed, by complete weighted data model Have the special feature that（I.e. objective being distributed in transaction journal with record change of project weights and change）Incorporate whole mining process In so that the correlation rule for being excavated more rationally and closer to actual, meanwhile, employ new Pruning strategy so that it is invalid and Barren frequent item set and negative dependent quantity are greatly reduced, and effectively reduce barren rule appearance, greatly increase Digging efficiency.

Description of the drawings

Fig. 1 is the frame for finding the complete weighting pattern method for digging of correlation rule between text word of the present invention Figure.

Fig. 2 is the totality for finding the complete weighting pattern method for digging of correlation rule between text word of the present invention Schematic flow sheet.

Fig. 3 is that the present invention tests the candidate quantity comparison diagram excavated under different support thresholds in 1.

Fig. 4 is that the present invention tests the frequent item set quantity comparison diagram excavated under different support thresholds in 1.

Fig. 5 is that the present invention tests rule (A → B) the quantity comparison diagram excavated under different support thresholds in 1.

Fig. 6 is that the present invention tests negative rule (A → B) the quantity comparison diagram excavated under different support thresholds in 1.

Fig. 7 is that the present invention tests negative rule (A → B) the quantity comparison diagram excavated under different support thresholds in 1.

Fig. 8 is that the present invention tests negative rule (A → B) the quantity comparison diagram excavated under different support thresholds in 1.

Fig. 9 is the candidate of different item mesh number in present invention experiment 2, frequent and negative dependent number change figure.

Figure 10 is the positive and negative correlation rule number change figure of different item mesh number in present invention experiment 2.

Figure 11 is the negative customers rule number change figure of different item mesh number in present invention experiment 2.

Figure 12 is the candidate of different document scale in present invention experiment 2, frequent and negative dependent number change figure.

Figure 13 is the negative customers rule number change figure of different document scale in present invention experiment 2.

Figure 14 is the positive and negative correlation rule number change figure of different document scale in present invention experiment 2.

Specific embodiment mode

In order to technical scheme is better described, below by complete weighted data model according to the present invention and phase The concept of pass is described below：

1. the difference that weighted association rules are excavated and all-weighted association is excavated

Weighted association rules are excavated and all-weighted association is excavated, and their main distinction is that its project weights is originated Different with the data model for being excavated, the former project weights are set by user is subjective, and independently of transaction database, once set Fixed, invariable in whole mining process, for example, the copy paper and facsimile machine in shop, as copy paper price is not as passing The height of prototype, its single-piece profit are lower than facsimile machine, different to the importance of profit contribution in order to embody commodity, and user is by single-piece The higher facsimile machine commodity of profit give higher weights, and the weights of copy paper commodity are relatively low, after its weight setting, just Immobilize, and independently of its transaction data base；The project weights of the latter are not to be set by the user, and are derived from affairs It is in database each transaction journal and different with transaction journal and change, for example, each Feature Words in the text database of magnanimity Project weights are derived from each document in its database, change as document is different, i.e., for different documents, its feature Lexical item mesh weights are different.

Item weighted data model and all-weighted item data model are that weighted association rules are excavated and weighting is closed completely respectively The data model of connection rule digging, is diverse two classes data model, as shown in Table 1 and Table 2, is wherein { i₁,i₂,..., i_mIt is its project set, { T₁,T₂,...,T_nIt is its affairs set.In weighted data model, { w₁,w₂,...,w_mIt is which Project weights, " 1 " of " 1/0 " represent that project occurs in transaction journal, and " 0 " represents absent variable situation.In complete weighted number According to model, " w [T_i][i_j]/0 (1≤i≤n, 1≤j≤m) " represents the weights of project, if project occurs in transaction journal, Its weights is " w [T_i][i_j] ", it is otherwise " 0 ".

1 weighted data model table of table, 2 all-weighted item data model

Example：Table 3 has 5 projects and 5 transaction journals, and wherein project set is { i₁,i₂,i₃,i₄,i₅}={Apple, Orange, Banana, Milk, Coca-cola }, as known from Table 3, i₁T is not appeared in₃In transaction journal.Table 4 is that an item is complete Full weighted data example, project and transaction journal quantity and with table 3, wherein, project i₁In transaction journal T₁,T₂,T₃,T₅In Weights are 0.85,0.93,0.65,0.75 respectively, do not appear in transaction journal T₄, therefore its weights is 0.

3 weighted data example tables of table, 4 all-weighted item data instance

2. complete weighted data excavates basic conception

If weighted data storehouse AWD={ T completely₁,T₂,...,T_n, number of transactions is n, T_iIn (1≤i≤n) expression AWD i-th Individual affairs, item collection I={ i₁,i₂,...,i_mWhole project sets in AWD are represented, item number is m, i_j(1≤j≤m) represents AWD In j-th project, w [T_i][i_j] (1≤i≤n, 1≤j≤m) expression project i_jIn transaction journal T_iIn weights, refer to table 2 All-weighted item data model.If I₁,I₂It is the Son item set of item collection I,And,Provide following substantially fixed Justice：

Define 1 (complete weighted support measure:All-weighted support, abbreviation awsup)：Complete weighted support measure Shown in the computing formula such as formula (1) of awsup (I).

Wherein,, n is the transaction journal sum of complete weighted data storehouse AWD, and k is the length of item collection I （That is the project number of I）.

Negative dependent and negative customers rule support such as formula (2) is weighted completely to formula (5) Suo Shi.

Awsup (I)=1 awsup (I) (2)

awsup(I₁→ I₂)=awsup(I₁∪ I₂)=awsup(I₁)–awsup(I₁∪I₂) (3)

Awsup (I₁→I₂)=awsup (I₁∪I₂)=awsup(I₂)–awsup(I₁∪I₂) (4)

Awsup (I₁→ I₂)=awsup (I₁∪ I₂)=1–awsup(I₁)–awsup(I₂)+awsup (I₁∪I₂) (5)

Define 2 (complete weighted frequent items and negative dependents)：If minimum support threshold value is minsup, for weighting completely Item collection I, if awsup (I) >=minsup, item collection I is called complete weighted frequent items.For weighting item collection (I completely₁∪I₂), Work as I₁And I₂When being all frequent item set, if awsup is (I₁∪I₂)<Minsup, then item collection (I₁∪I₂) be referred to as weighting negative dependent completely.

Example：If minsup=0.1, in 4 data of table, awsup (i₂)=(0.21+0.35+0.05)/(5×1)=0.122> Minsup, awsup (i₄)=0.192>Minsup, awsup (i₂∪i₄)=0.06<Minsup, therefore item collection (i₂∪i₄) it is to add completely Power negative dependent.

Define 3 and (weight item collection interest-degree completely：All-weighted Itemset Interest, i.e., awItemsetInt)：Interest-degree is the tolerance of association mode degree of concern of the user to being excavated, and its value is higher, illustrates the pass Gang mould formula is noveler, and user is higher to its degree of concern.Based on the interest-degree model excavated without weighted data under environment (Cheng Jihua, Guo Jiansheng, Shi Pengfei. excavate many strategy process research [J] of rule of interest. Chinese journal of computers, 2000,23 (1):47-51.), be given：

awItemsetInt(I₁∪I₂)=awsup(I₁)×awsup(I₁∪I₂)×(1–awsup(I₂)) (6)

awItemsetInt(I₁∪ I₂)=awsup(I₁)×awsup(I₂)×(awsup(I₁)–awsup(I₁∪I₂)) (7)

AwItemsetInt (I₁∪I₂)=(1–awsup(I₁))×(1–awsup(I₂)×(awsup(I₂)–awsup(I₁∪I₂)) (8)

AwItemsetInt (I₁∪ I₂)=awsup(I₂)×(1–awsup(I₁))×(1–awsup(I₁)–awsup (I₂)+awsup (I₁∪I₂)) (9)

Define 4 and (weight CPIR values completely：All-weighted Conditional_Probability Increment Ratio, abbreviation awCPIR)：CPIR models are expressing p (I with the ratio of conditional probability and prior probability₂/I₁) relative p (I₂) Incremental degree, give its computing formula in document：CPIR(I₂/I₁)=(p(I₂/I₁)–p(I₂))/(1–p(I₂)).It is based on The needs that the computing formula of CPIR models and complete weighted data are excavated, provide the awCPIR for weighting positive and negative correlation rule completely Computing formula such as formula (10) is to formula (13) Suo Shi：

Using awCPIR values as all-weighted association confidence level, its value is bigger, illustrates the credible of the correlation rule Degree is higher, is more paid close attention to by user.

Example：In 4 complete data of table, awsup (i₁)=0.636, awsup (i₁)=1-0.636=0.364, awsup (i₂)= 0.122, awsup (i₁∪i₂)=0.294, awCPIR (i₁→i₂)=(| 0.294-0.636 × 0.122 |)/(0.636 × (1- 0.122))=0.39, awCPIR (i₁→ i₂)=2.79, awCPIR (i₁→i₂)=0.68, awCPIR (i₁→ i₂)= 4.86。

Define 5 (weights ratios in complete weighted term：All-weighted Weight Ratio from Itemset, referred to as awIWR):If w₁₂And w₁、w₂Item collection (I is weighted completely respectively₁,I₂) and its Son item set I₁And I₂In complete weighted data storehouse AWD In weights summation, by w₁₂(w₁×w₂) ratio referred to as completely weight item collection in weights ratio, weights ratio in abbreviation item (awIWR(I₁,I₂)), i.e., shown in formula (14).

Define 6 (dimension ratios in complete weighted term：All-weighted Dimension Ratio from Itemset, letter Claim awIDR)：If k₁₂, k₁And k₂Respectively item collection (I₁,I₂) and its Son item set I₁And I₂Project number, by k₁₂(k₁×k₂) Ratio referred to as completely weight item collection in dimensional ratio, dimension ratio (awIDR (I in abbreviation item₁,I₂)), i.e., shown in formula (15).

Define 7 and (weight item collection correlation completely:All-weighted itemset correlation, referred to as awISCorr)：(Chengqi Zhang, Shichao Zhang.Association is defined based on traditional item collection correlation rule mining:models and algorithms[M].Springer-Verlag Berlin,Heidelberg,2002: 47-84,ISBN:3-540-43533-6.), provide weighting item collection (I completely₁,I₂) correlation (awISCorr (I₁,I₂),) computing formula such as formula (16) shown in.

According to the property of correlation, excavate under environment in complete weighted data, item collection (I₁,I₂) correlation has following property Matter：

Property 1：

Property 2：

Property 3：

Property 4：2. awISCorr (I₁,I₂)<1;③ AwISCorr (I₁, I₂)>1。

Property 5：2. awISCorr (I₁,I₂)>1;③ AwISCorr (I₁, I₂)<1。

Inference is excavated in environment in complete weighted data, it is known that item collection (I₁,I₂), andIf 1. n × awIWR (I₁,I₂)>awIDR(I₁,I₂), then Son item set I is weighted completely₁And I₂Into positive correlation, and can excavate and weight positive association completely Regular I₁→I₂With negative customers rule I₁→ I₂Pattern；If 2. n × awIWR (I₁,I₂)<awIDR(I₁,I₂), then weight completely Item collection I₁And I₂Into negative correlation, and weighting negative customers rule I completely can be excavated₁→ I₂And I₁→I₂Pattern；

According to above-mentioned inference, when all-weighted association is excavated, only need to calculate weights in complete weighted term compares awIWR (I₁,I₂) and dimension than awIDR (I₁,I₂), it is not required to calculate item collection correlation, it is possible to directly from frequent item set and negative dependent Excavation weights positive and negative correlation rule completely.

Example：For (i₁,i₂,i₃), if I₁=(i₁,i₂), I₂=(i₃), then awIWR (I₁,I₂)=3.34/(2.94× 2.85)=0.399, awIDR (I₁,I₂)=3/ (2 × 1)=1.5, n × awIWR (I₁,I₂)=5×0.5517=1.995>1.5= awIDR(I₁,I₂), according to above-mentioned inference, I₁And I₂Into positive correlation, correlation rule I can be excavated₁→I₂With negative customers rule I₁ → I₂Pattern.Verified using formula (16)：awsup(i₁∪i₂)=0.294, awsup (i₃)=0.57, awsup (i₁∪i₂∪i₃)= 0.223, awISCorr (I₁,I₂)=0.223/(0.294×0.57)=1.33>1, by property 1 and property 4, I₁And I₂Into positive Close, correlation rule I can be excavated₁→I₂With negative customers rule I₁→ I₂Pattern, conclusion are consistent.

In the same manner, for weighting item collection (i completely₂,i₄), its awIWR (i₂,i₄)=0.102, awIDR (i₂,i₄)=2, n × awIWR(i₂,i₄)=0.51<2=awIDR(i₂,i₄), according to inference, i₂And i₄Into negative correlation, i can be excavated₂→ i₄ And i₂→i₄Pattern.

Define 8 (effectively weighting positive and negative correlation rule completely)：If minconf is minimal confidence threshold, when completely plus Claim collection I₁And I₂Meet following 3 conditions, then claim correlation rule I₁→I₂, I₁→ I₂、I₁→ I₂And I₁→I₂For having The positive and negative correlation rule of weighting completely of effect：①I₁And I₂It is complete weighted frequent items, I₁∩I₂=φ；②I₁→I₂, I₁→ I₂、I₁→ I₂And I₁→I₂Support be more than or equal to minsup；③I₁→I₂, I₁→ I₂、I₁→ I₂And I₁→ I₂AwCPIR values be not less than minconf.

Example：Assume minsup=0.1, minconf=0.3, know from upper example, weight item collection (i completely₁,i₂)、(i₃) (i₁,i₂,i₃) support be both greater than minsup, (i₁,i₂) and (i₃) into positive correlation, and because, awCPIR ((i₁,i₂)→ (i₃))=|0.223–0.94×0.57|/(0.294×(1–0.57))=0.438>Minconf, awCPIR ((i₁,i₂) → (i₃))=0.138<Minconf, according to property 4 and definition 8, (i₁,i₂)→(i₃) it is that an effective positive association that weights completely is advised Then, negative rule (i₁,i₂) → (i₃) it is not effective.In the same manner, for weighting item collection (i completely₂,i₄), due to awsup (i₂)=0.122>Minsup, awsup (i₄)=0.192>Minsup, awsup (i₂∪ i₄)=0.062<Minsup, awsup ( i₂∪i₄)=0.132>Minsup, awCPIR (i₂→i₄)=0.052<Minconf, according to definition 8, negative customers rule i₂→ i₄And i₂→i₄It is not effectively to weight negative customers rule completely.

Technical scheme is described further below by specific embodiment.

Process following (wherein, minsup of the present invention to 4 complete weighted data Case digging all-weighted association of table =0.1, minInt=0.1, minconf=0.4, w represent a collects weights, behalf item collection support)：

Step1:awPIS={φ}；awNIS={φ}；

Step2:

Step3：① ② ③

Step4：Beta pruning：For the item collection beta pruning in frequent item set set awPIS.The frequent item set wiped out is：(i₂, i₃),(i₃,i₄),(i₁,i₂,i₅),(i₁,i₃,i₅), the awPIS={ (i after beta pruning₁,i₂),(i₁,i₃),(i₁,i₅),(i₁,i₂, i₃)}

Step5:In the same manner, in negative dependent set awNIS, the negative dependent wiped out is：(i₃,i₅), the awNIS=after beta pruning {(i₁,i₄),(i₂,i₄),(i₂,i₅),(i₄,i₅)}。

Step6:Excavate from frequent item set set awPIS and in negative dependent set awNIS and weight completely positive negative customers rule Then, with frequent item set (i₁,i₂,i₃) and negative dependent (i₄,i₅) as a example by, provide its mining process as follows：

For frequent item set (i₁,i₂,i₃), with its subset I₁=(i₁) and I₂=(i₂,i₃) as a example by, knowable in upper example, awsup(i₁)、awsup(i₂,i₃) it is all higher than minsup, awIDR (I₁,I₂)=1.5, n × awIWR (I₁,I₂)=2.98>awIDR (I₁,I₂), awsup (I₁∪I₂)=0.223>Minsup, awCPIR (I₁→I₂)=0.212<Minconf, awCPIR (I₂→I₁)= 1.73>minconf；Awsup (I₁∪ I₂)=0.411>Minsup, awCPIR (I₁→ I₂)=1.73>Minconf, AwCPIR (I₂→ I₁)=0.212<Minconf, therefore, I₂→I₁And I₁→ I₂(i.e. (i₂,i₃)→(i₁) and (i₁) → (i₂,i₃)) it is effectively to weight positive and negative correlation rule completely.

For negative dependent (i₄,i₅), its subset I₁=(i₄) and I₂=(i₅), knowable in upper example, awsup (i₄)、awsup (i₅) it is all higher than minsup, awIDR (I₁,I₂)=2, n × awIWR (I₁,I₂)=1.03<awIDR(I₁,I₂), awsup (I₁∪ I₂)=0.101>Minsup, awsup (I₁∪I₂)=0.093<Minsup, awCPIR (I₁→ I₂)=1.577>Minconf, AwCPIR (I₂→I₁)=0.084<Minconf, therefore, I₁→ I₂(i.e. (i₄) → (i₅)) it is effectively to weight completely Negative customers rule.

Beneficial effects of the present invention are described further below by experiment.

In order to verify effectiveness of the invention, correctness and autgmentability, we select to be carried by network laboratories of Peking University For Chinese Web test set CWT200g (Chinese Web Test Collection with200GB web pages) portion Divide language material as this paper experimental data test sets.The running environment of experiment is Intel (R) Core (TM) i7-3770CPU@ 3.4GHz3.4GHz, internal memory 4.0G, operating system is windows7, and programming language is realized adopting delphi2006, data base set Unite as SQL Server2008.Select typically without the positive and negative association rule mining method of weighting（Xindong Wu,Chengqi Zhang,and Shichao Zhang,Efficient Mining of Both Positive and Negative Association Rules,ACM Transactions on Information Systems,22(2004),3:381- 405.）(being designated as PNAR-Mining methods) is Experimental comparison's method.

The capacity of Chinese Web test set CWT200g is 197GB, comprising 37,482,913 webpages, and each page is according to day Net storage format is compressed arrangement.It is extracted 12024 plain text documents from CWT200g test sets to survey as experiment document Examination collection.Using Chinese lexical analysis system ICTCLAS (Inst. of Computing Techn. Academia Sinica develops and writes) to test text Document participle.Feature Words weights (w_ij) computing formula be w_ij=(0.5+0.5×tf_ij/max_j(tf_ij))×idf_i.Experiment is surveyed Examination document preprocessing process be：Participle, stop words is removed, is extracted and Feature Words and is calculated its weights, being built based on vector space mould The text database and feature dictionary of type.After the collection pretreatment of experiment wen chang qiao district, 8751 Feature Words, its document frequency is obtained (number of documents i.e. containing this feature word) df is 51 to 11258.According to needs are excavated, remove df values in experiment than relatively low and ratio Higher Feature Words, extract the Feature Words that df values are 1500 to 5838（400 Feature Words are obtained now）Construction feature lexical item Mesh storehouse.Total frequency that Feature Words occur in 12024 experiment test documents is 1019494 times, is averagely gone out in every document It is existing 85 times.Experiment parameter is as shown in table 5.

5 experiment parameter table of table

Experiment 1：Performance comparision is excavated under support threshold situation of change

Under different support thresholds, AWPNAR-Mining and control methods PNAR-Mining is invented herein in experiment text Item collection (i.e. candidate (Candidate Itemset, CI), frequent item set (Frequent is excavated in shelves test set Itemset, FI), negative dependent (Negative Itemset, NI)) and positive and negative correlation rule (Positive and Negative Association Rule, PNAR) quantity compares (ItemNum=50, minconf=0.0002, minInt as shown in Figures 3 to 8 =0.0002,TRecordNum=12024)。

Experiment 2：Performance comparision is excavated under confidence threshold value situation of change

AWPNAR-Mining and control methods PNAR-Mining are invented under confidence threshold value situation of change herein in experiment Wen chang qiao district concentrates excavation positive and negative correlation rule (A → B, A → B, A → B and A → B) quantity more as shown in table 6 (minsup=0.03, minInt=0.0002, ItemNum=50, TRecordNum=12024).

The positive and negative correlation rule quantity excavated under 6 different confidence threshold values of table compares

Experiment 3：Excavate time efficiency Performance comparision

Time efficiency performances are excavated in order to compare 2 kinds of methods, we are respectively under support threshold situation of change and confidence The excavation time for inventing AWPNAR-Mining and control methods PNAR-Mining herein is counted in the case of degree changes of threshold, its knot Fruit is as shown in table 7 and table 8 (minInt=0.0002, ItemNum=50, TRecordNum=12024).Table 7 represents support threshold The lower 2 kinds of method for digging of situation of change concentrate the time for excavating item collection and correlation rule to compare (minconf=in experiment wen chang qiao district 0.0002), table 8 represents that the excavation positive and negative correlation rule time under confidence threshold value situation of change compares (minsup=0.03).

Item collection and correlation rule time (unit are excavated under 7 different support thresholds of table：Second) compare

Time (the unit of positive and negative correlation rule is excavated under 8 different confidence threshold values of table：Second) compare

Experiment 4：Scalable Performance is analyzed

We change extensibility of two kinds of situations to the inventive method from number of entry change and data test set scale Can experiment and analysis.

In order to test the extensibility of the present invention, experiment parameter is set：ItemNum=50, TRecordNum=12024, Minsup=0.05, minconf=0.07, minInt=0.001, change respectively in number of entry change and data test set scale In the case of, AWPNAR-Mining methods of the present invention Mining Frequent Itemsets Based (FI), negative dependent (NI) and just in data test collection 1 Negative customers rule (PNAR) isotype number change result is as shown in Fig. 9 to Figure 14.

In a word, it is above-mentioned test result indicate that, compare with control methods PNAR-Mining, AWPNAR-Mining side of the present invention The excavation performance of method has reached good effect, and digging efficiency is greatly improved；Either change feelings in support threshold Condition or confidence threshold value situation of change, candidate, frequent item set and negative dependent and positive negative customers rule that the present invention is excavated Then quantity is few many than control methods.

Claims

1. a kind of complete weighting pattern method for digging for finding correlation rule between text word, it is characterised in that including as follows Step：

(1) complete weighted data pretreatment stage：Pending complete weighted data is pre-processed, complete weighted number is built According to storehouse and project library；

(2) complete weighted frequent items and negative dependent excavation phase, comprise the following steps 2.1 and step 2.2：

2.1st, extract from project library and weight completely candidate's 1_ item collections, and excavation weights frequent 1_ item collections completely；Concrete steps are pressed Carry out according to 2.1.1～2.1.3：

2.1.1, extract from project library and weight completely candidate's 1_ item collections；

2.1.2, add up and weight weights summation of the candidate 1_ item collections in complete weighted data storehouse completely, calculate its support；

2.1.3 it is frequent more than or equal to the weighting completely of minimum support threshold value that support in candidate's 1_ item collections, is weighted completely 1_ item collections are added to complete weighted frequent items set；

2.2.1, will weight frequent (i-1) _ item collection completely carries out Apriori connections, generates；It is described I >=2；

2.2.2, add up and weight weights summation of the candidate i_ item collections in complete weighted data storehouse completely, calculate its support；

2.2.3, take out from the frequent i_ item collections for being weighted in candidate's i_ item collections by its support not less than support threshold completely, deposit Enter complete weighted frequent items set, meanwhile, its support has been stored in less than the negative i_ item collections of the weighting completely of support threshold Full weighting negative dependent set；

2.2.4 the value of i is added 1, if frequently (i-1) _ item collection proceeds to (3) step for sky, otherwise, continues 2.2.1～2.2.3 Step；

(3) the beta pruning stage：Interesting complete weighted frequent items and negative dependent are obtained by the beta pruning stage：

3.1st, for each frequent i- item collection awL in frequent item set set_i, calculate IAWFI (awL_i) value, wipe out its IAWFI (awL_i) value is false frequent item set, obtains interesting complete weighted frequent items set after beta pruning；IAWFI(awL_i) calculate public Formula is as follows：

Wherein, awItemsetInt (I₁∪I₂)=awsup (I₁)×awsup(I₁∪I₂)×(1–awsup(I₂)), AwItemsetInt (I₁, I₂)=awsup (I₂)×(1–awsup(I₁))×(1–awsup(I₁)–awsup(I₂)+awsup (I₁∪I₂)), minInt be minimum interestingness threshold value, minsup minimum support threshold values；

3.2nd, for each negative i- item collection awN for being weighted in negative dependent set completely_i, calculate IAWNI (awN_i) value, wipe out which IAWNI(awN_i) value is false negative dependent, obtains the interesting negative dependent set of weighting completely after beta pruning；IAWNI(awN_i) calculating Formula is as follows：

Wherein, awItemsetInt (I₁∪I₂)=awsup (I₁)×awsup(I₁∪I₂)×(1–awsup(I₂))；

awItemsetInt(I₁∪ I₂)=awsup (I₁)×awsup(I₂)×(awsup(I₁)–awsup(I₁∪I₂))；

AwItemsetInt (I₁∪I₂1 awsup (I of)=(₁))×(1–awsup(I₂)×(awsup(I₂)–awsup(I₁∪ I₂))；

AwItemsetInt (I₁∪ I₂)=awsup (I₂)×(1–awsup(I₁))×(1–awsup(I₁)–awsup(I₂)+ awsup(I₁∪I₂))；

(4) excavate from interesting complete weighted frequent items set and effectively weight positive and negative correlation rule completely, including following Step：

4.1st, frequent item set awL is taken out from interesting complete weighted frequent items set_i, obtain awL_iAll proper subclass, build awL_iProper subclass set, then carry out following operation：

4.2.1, from awL_iProper subclass set in arbitrarily take out two proper subclass I₁And I₂, work as I₁And I₂Common factor be empty set, I₁ And I₂Project number sum be equal to the project number of its former frequent item set, and I₁And I₂Support be all not less than support Threshold value, then calculate frequent item set (I₁∪I ₂) item in weights than awIWR (I₁,I₂) and its dimension than awIDR (I₁,I₂)； awIWR(I₁,I₂) and awIDR (I₁,I₂) computing formula it is as follows：

a w I W R (I_{1}, I_{2}) = \frac{w_{12}}{w_{1} \times w_{2}};

a w I D R (I_{1}, I_{2}) = \frac{k_{12}}{k_{1} \times k_{2}};

w₁₂And w₁、w₂Item collection (I is weighted completely respectively₁,I₂) and its Son item set I₁And I₂Power in complete weighted data storehouse AWD Value summation, k₁₂, k₁And k₂Respectively item collection (I₁,I₂) and its Son item set I₁And I₂Project number；

4.2.2, when in database transaction journal sum n and above-mentioned 4.2.1 step item in weights than awIWR (I₁,I₂) product it is big In its dimension than awIDR (I₁,I₂) when, i.e. n × awIWR (I₁,I₂)>awIDR(I₁,I₂) when, proceed as follows：

If 4.2.2.1 I₁→I₂AwCPIR value awCPIR (I₁→I₂) be not less than confidence threshold value minconf, then excavate completely Weighted association rules I₁→I₂；If I₂→I₁AwCPIR value awCPIR (I₂→I₁) be not less than confidence threshold value minconf, then dig Excavate all-weighted association I₂→I₁；awCPIR(I₁→I₂) and awCPIR (I₂→I₁) computing formula it is as follows：

awCPIR (I_{1} &RightArrow; I_{2}) = \frac{awsup (I_{2} \cup I_{1}) - awsup (I_{1}) awsup (I_{2})}{awsup (I_{1}) (1 - awsup (I_{2}))};

awCPIR (I_{2} &RightArrow; I_{1}) = \frac{awsup (I_{2} \cup I_{1}) - awsup (I_{1}) awsup (I_{2})}{awsup (I_{1}) (1 - awsup (I_{1}))};

If 4.2.2.2 I₁∪ I₂Support awsup (I₁∪ I₂) it is not less than support threshold minsup, then, 1. If I₁→ I₂AwCPIR value awCPIR (I₁→ I₂) be not less than confidence threshold value minconf, then excavate completely Weighting negative customers rule I₁→ I₂；If 2. I₂→ I₁AwCPIR value awCPIR (I₂→ I₁) it is not less than confidence Degree threshold value minconf, then excavate weighting negative customers rule I completely₂→ I₁；Awsup (I₁∪ I₂), awCPIR ( I₁→ I₂) and awCPIR (I₂→ I₁) computing formula it is as follows：

Awsup (I₁∪ I₂)=awsup (I₁∪ I₂Awsup (the I of)=1₁)–awsup(I₂)+awsup(I₁∪I₂)；

4.2.3, when in database transaction journal sum n and above-mentioned 4.2.1 step item in weights than awIWR (I₁,I₂) product it is little In its dimension than awIDR (I₁,I₂) when, i.e. n × awIWR (I₁,I₂)<awIDR(I₁,I₂) when, proceed as follows：

If 4.2.3.1 I₁∪ I₂Support awsup (I₁∪ I₂) it is not less than support threshold minsup, then, if 1. I₁→ I₂AwCPIR value awCPIR (I₁→ I₂) be not less than confidence threshold value minconf, then excavate Connection rule I₁→ I₂；If 2. I₂→I₁AwCPIR value awCPIR (I₂→I₁) it is not less than confidence threshold value minconf, Weighting negative customers rule I completely is excavated then₂→I₁；awsup(I₁∪ I₂)、awCPIR(I₁→ I₂) and awCPIR ( I₂→I₁) computing formula it is as follows：

awsup(I₁→ I₂)=awsup (I₁∪ I₂)=awsup (I₁)–awsup(I₁∪I₂)；

If 4.2.3.2 I₁∪I₂Support awsup (I1 ∪ I2) be not less than support threshold minsup, then, 1. such as Fruit I₁→I₂AwCPIR value awCPIR (I₁→I₂) be not less than confidence threshold value minconf, then excavate to weight completely and bear Correlation rule I₁→I₂；If 2. I₂→ I₁AwCPIR value awCPIR (I₂→ I₁) it is not less than confidence threshold value Minconf, then excavate weighting negative customers rule I completely₂→ I₁；Awsup (I₁∪I₂), awCPIR (I₁→I₂) and awCPIR(I₂→ I₁) computing formula it is as follows：

Awsup (I₁→I₂)=awsup (I₁∪I₂)=awsup (I₂)–awsup(I₁∪I₂)；

4.2.4,4.2.1～4.2.3 steps are continued, if awL_iProper subclass set in each proper subclass and if only if is taken Go out once, then proceed to 4.2.5 steps；

4.2.5,4.1 steps are continued, if each frequent item set awL in interesting complete weighted frequent items set_iAll when and only When being removed once, then (5th) step is proceeded to；

(5) excavate from the interesting negative dependent set of weighting completely and effectively weight completely negative customers rule, comprise the following steps：

5.1st, negative dependent awN is taken out from the interesting negative dependent of weighting completely set_i, obtain awN_iAll proper subclass, build awN_i Proper subclass set, then carry out following operation：

5.2.1, from awN_iProper subclass set in arbitrarily take out two proper subclass I₁And I₂, work as I₁And I₂Common factor be empty set, I₁ And I₂Project number sum be equal to the project number of its former frequent item set, and I₁And I₂Support both greater than or be equal to Support threshold, then calculate negative dependent I₁∪I₂Item in weights than awIWR (I₁,I₂) and its dimension than awIDR (I₁,I₂)；

5.2.2, when in database transaction journal sum n and above-mentioned 5.2.1 step item in weights than awIWR (I₁,I₂) product it is big In its dimension than awIDR (I₁,I₂) when, i.e. n × awIWR (I₁,I₂)>awIDR(I₁,I₂) when, proceed as follows：

If 5.2.2.1 I₁∪ I₂Support be more than or equal to support threshold minsup, then, if 1. I₁ → I₂AwCPIR value awCPIR (I₁→ I₂) be more than or equal to confidence threshold value minconf, then excavate and add completely Power negative customers rule I₁→ I₂；If 2. I₂→ I₁AwCPIR value awCPIR (I₂→ I₁) be more than or equal to Confidence threshold value minconf, then excavate weighting negative customers rule I completely₂→ I₁；

5.2.3, when in database transaction journal sum n and above-mentioned 5.2.1 step item in weights than awIWR (I₁,I₂) product it is little In its dimension than awIDR (I₁,I₂) when, i.e. n × awIWR (I₁,I₂)<awIDR(I₁,I₂) when, proceed as follows：

If 5.2.3.1 I₁∪ I₂Support be more than or equal to support threshold minsup, then, if 1. I₁→ I₂'s AwCPIR value awCPIR (I₁→ I₂) be more than or equal to confidence threshold value minconf, then excavate Then I₁→ I₂；If 2. I₂→I₁AwCPIR value awCPIR (I₂→I₁) it is more than or equal to confidence threshold value Minconf, then excavate weighting negative customers rule I completely₂→I₁；

If 5.2.3.2 I₁∪I₂Support be more than or equal to support threshold minsup, then, if 1. I₁→I₂'s AwCPIR value awCPIR (I₁→I₂) be more than or equal to confidence threshold value minconf, then excavate Then I₁→I₂；If 2. I₂→ I₁AwCPIR value awCPIR (I₂→ I₁) it is more than or equal to confidence threshold value Minconf, then excavate weighting negative customers rule I completely₂→ I₁；

5.2.4,5.2.1～5.2.3 steps are continued, if awN_iProper subclass set in each proper subclass and if only if is taken Go out once, then proceed to 5.2.5 steps；

5.2.5,5.1 steps are continued, if each negative dependent awN in the interesting negative dependent set of weighting completely_iAll and if only if quilt Take out once, then weight positive and negative association rule mining completely and terminate；

" " is negatively correlated symbol, I₁Expression occurs without I in issued transaction₁Event, referred to as negative dependent I₁；I₁∪ I₂ An item collection is represented, the item collection has Son item set I₁With negative Son item set I₂；Correlation rule I₁→ I₂Which is meant that：If subset I₁'s Event occurs or occurs, then subset I₂Event be not in or do not occur.

2. the complete weighting pattern method for digging for finding correlation rule between text word according to claim 1, which is special Levy and be, what described pending complete weighted data was pre-processed concretely comprises the following steps, when pending complete weighted data is During Chinese text data, participle is carried out, stop words is removed, is extracted Feature Words and calculate its weights；When pending weighting completely When data are English text data, stem extraction are carried out, stop words, lexical analysis is excluded, is extracted Feature Words and calculate its weights.