CN103838854B - Completely-weighted mode mining method for discovering association rules among texts - Google Patents

Completely-weighted mode mining method for discovering association rules among texts Download PDF

Info

Publication number
CN103838854B
CN103838854B CN201410096985.2A CN201410096985A CN103838854B CN 103838854 B CN103838854 B CN 103838854B CN 201410096985 A CN201410096985 A CN 201410096985A CN 103838854 B CN103838854 B CN 103838854B
Authority
CN
China
Prior art keywords
awsup
completely
awcpir
item
negative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410096985.2A
Other languages
Chinese (zh)
Other versions
CN103838854A (en
Inventor
黄名选
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
Guangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University of Finance and Economics filed Critical Guangxi University of Finance and Economics
Priority to CN201410096985.2A priority Critical patent/CN103838854B/en
Publication of CN103838854A publication Critical patent/CN103838854A/en
Application granted granted Critical
Publication of CN103838854B publication Critical patent/CN103838854B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Abstract

The invention discloses a completely-weighted mode mining method for discovering association rules among texts. Completely-weighted data to be processed are pre-processed, and a completely-weighted database and an item .library are established; a completely-weighted frequent item set and a negative item set are mined, and an interesting completely-weighted frequent item set and an interesting negative item set are obtained through pruning; the effective completely-weighted positive and negative association rules are mined through a support degree-CPIR model-correlation-interestingness evaluation framework. The completely-weighted mode mining method can overcome the defects of the existing weighing mining technology. Item weights are objectively distributed in the database and integrated with the completely-weighted mode mining method along with the completely-weighted data characteristics of the business record change, and a more actual and reasonable completely-weighted positive and negative association mode can be obtained. An invalid and uninteresting association mode is avoided. The number of mined candidate items, the number of frequent item sets, the number of negative item sets and the number of positive and negative association rule modes are smaller than the number of mined candidate items, the number of frequent item sets, the number of negative item sets and the number of positive and negative association rule modes in the prior art. The mining efficiency is greatly improved, and the completely-weighted mode mining method has good extendibility.

Description

For finding the complete weighting pattern method for digging of correlation rule between text word
Technical field
The invention belongs to Data Mining, specifically a kind of weighting completely for finding correlation rule between text word is just Negative mode method for digging, it is adaptable to the neck such as the discovery of Feature Words association mode and document information retrieval query expansion in text mining Domain.
Background technology
Over nearly 20 years, association rule mining obtains the great interest of numerous scholars and research, has become data mining and grinds One of focus studied carefully, its research are concentrated mainly on Face.
Equality that what the positive and negative association mode based on project frequency was excavated be mainly characterized by item as one man in processing data storehouse Mesh, excavates association mode using the probability that item collection occurs in database as support.Correlation rule based on project frequency digs Digging the defect for existing is:Only pay attention to project frequency, neglected items weights frequently result in redundancy, barren and invalid association Rule increases.
In order to overcome the defect of above-mentioned association rule mining method, obtained based on the positive and negative association rule mining of project weights Pay attention to and study, which introduce a weight, to have different importance and project to have in database between embodiment project There are different weights.The positive and negative association rule mining of weighting is divided into based on the positive and negative association rule mining of project weights and is weighted completely Positive and negative association rule mining.Its project weights that are mainly characterized by for weighting positive and negative association rule mining are embodied Different importance, with going deep into for research, weights the effect day aobvious protrusion of negative customers rule, while favorable factor is excavated It is also desirable that and finds some unfavorable factors, this purpose can be reached by the analysis of negative customers rule.What weighted association rules were excavated Defect is to have ignored project weights to have a case that different weights in database each transaction journal.By project weights objective point The data for being distributed in transaction journal and changing with record change are referred to as complete weighted data.Existing weighted association rules method for digging Complete weighted data can not be suitable for excavate, for this purpose, since 2003, all-weighted association Research on Mining obtained concern and Research, currently, weight completely positive and negative Association Rule Mining the fields such as text mining, information retrieval have it is important theoretical and Using value.The defect that all-weighted association method for digging can be excavated efficiently against weighted association rules, but also not Can solve to weight negative customers rule digging technical problem completely.For these problems, the present invention is to weighting positive negative customers rule completely Then excavate and furtherd investigate, propose that a kind of positive and negative correlation rule of new weighting completely based on weights ratio and dimension ratio in item digs Pick method, is applied to document information retrieval query expansion, can improve retrieval performance, be applied to text mining, it can be found that more Plus actual rational positive negative feature words association mode.
The content of the invention
Present invention aims to the deficiency that prior art is present, there is provided a kind of to advise for finding to associate between text word Complete weighting pattern method for digging then, enriches the Association Rule Mining achievement excavated based on project weights, solves item complete Weight full the technical barrier in positive and negative association rule mining.The method has important in fields such as text mining, document information retrievals Theory value and wide application prospect.
The present invention realizes that above-mentioned purpose is adopted the technical scheme that:It is a kind of for finding the complete of correlation rule between text word Full weighting pattern method for digging, comprises the steps:
(1)Complete weighted data pretreatment stage:
In real world, there is the complete weighted data of magnanimity, such as Text Information Data etc..Weighted data pretreatment completely Method will depending on specific data object, for example, for Chinese text data message, then to carry out participle, remove stop words, Extract the preprocess method such as Feature Words and its weight computing;For English text data message, preprocess method be stem extract, Exclude stop words, lexical analysis, extract Feature Words and its weight computing etc..The result of weighted data pretreatment completely is to build base In complete weighted data storehouse and project library;
For the Feature Words weight computing formula of text data is:wij=(0.5+0.5×tfij/maxj(tfij))×idfi,
Wherein, wijFor ith feature word jth piece document weights, tfijIt is ith feature word in jth piece document Word frequency, idfiFor the reverse document frequency of ith feature word, its value idfi=log(N/dfi), N is total number of documents in document sets, dfiIt is the number of documents containing ith feature word.
(2)Weighted frequent items and negative dependent excavation phase, comprise the following steps 2.1 and step 2.2 completely:
2.1st, extract from project library and weight completely candidate's 1_ item collections awC1, and excavation weights frequent 1_ item collections awL completely1; Concrete steps are carried out according to 2.1.1~2.1.3:
2.1.1, extract from project library and weight completely candidate's 1_ item collections awC1
2.1.2, add up and weight candidate's 1_ item collections awC completely1In complete weighted data storehouse (All-Weighted Database, abbreviation AWD) in weights summation, calculate its support;
awC1Support computing formula is as follows:
Wherein,Expression project ijIn transaction journal TiIn weights summation, n for completely plus The transaction journal sum of power database AWD, k is item collection awC1Length(That is awC1Project number).
2.1.3, candidate's 1_ item collections C will be weighted completely1Middle support is more than or equal to minimum support threshold value minsup Frequent 1_ item collections awL are weighted completely1It is added to frequent item set set awPIS;
2.2nd, from the beginning of candidate's 2_ item collections are weighted completely, operated according to step 2.2.1~2.2.4:
2.2.1 frequent (i-1) _ item collection awL, will be weighted completelyi-1Apriori connections are carried out, is generated Item collection awCi;Described i >=2;
2.2.2, add up and weight candidate's i_ item collections awC completelyi-1Weights summation in complete weighted data storehouse AWD, calculates Its support awsup (awCi-1), its computing formula is as follows:
Wherein,Expression project ijIn transaction journal TiIn weights summation, n has been The transaction journal sum of full weighted data storehouse AWD, k is item collection awCi-1Length.
2.2.3, from weighting candidate's i_ item collections awC completelyiThe middle frequency that its support is not less than support threshold minsup Numerous i_ item collections awLiTake out, be stored in complete weighted frequent items set awPIS, meanwhile, its support is less than into support threshold Negative i_ item collections awN of weighting completelyiIt is stored in weighting negative dependent set awNIS completely.
2.2.4 the value of i is added 1, if frequently (i-1) _ item collection awLi-1For sky(It is that its length is 0)Proceed to(3)Step, Otherwise, 2.2.1~2.2.3 steps are continued;
(3)The beta pruning stage:Interesting complete weighted frequent items and negative dependent are obtained by the beta pruning stage
3.1st, for each frequent i- item collection awL in frequent item set set awPISi, calculate IAWFI (awLi) value, cut Except its IAWFI (awLi) value is false frequent item set, obtains interesting complete weighted frequent items set awPIS after beta pruning; IAWFI(awLi) computing formula is as follows:
Wherein, awItemsetInt (I1∪I2)=awsup(I1)×awsup(I1∪I2)×(1–awsup(I2)), AwItemsetInt (I1, I2)=awsup(I2)×(1–awsup(I1))×(1–awsup(I1)–awsup(I2)+awsup (I1∪I2)), minInt be minimum interestingness threshold value, minsup minimum support threshold values.
3.2nd, for each negative i- item collection awN in negative dependent set awNISi, calculate IAWNI (awNi) value, wipe out which IAWNI(awNi) value is false negative dependent, obtains the interesting negative dependent set awNIS of weighting completely after beta pruning;IAWNI(awNi) Computing formula it is as follows:
Wherein, awItemsetInt (I1∪I2)=awsup(I1)×awsup(I1∪I2)×(1–awsup(I2))
awItemsetInt(I1∪ I2)=awsup(I1)×awsup(I2)×(awsup(I1)–awsup(I1∪I2))
AwItemsetInt (I1∪I2)=(1–awsup(I1))×(1–awsup(I2)×(awsup(I2)–awsup(I1 ∪I2))
AwItemsetInt (I1∪ I2)=awsup(I2)×(1–awsup(I1))×(1–awsup(I1)–awsup (I2)+awsup (I1∪I2))
(4)Excavate from interesting complete weighted frequent items set awPIS and effectively weight positive and negative correlation rule completely, Comprise the following steps:
4.1st, frequent item set awL is taken out from interesting complete weighted frequent items set awPISi, obtain awLiIt is all true Subset, builds awLiProper subclass set, then carry out following operation:
4.2.1, from awLiProper subclass set in arbitrarily take out two proper subclass I1And I2, work as I1And I2Common factor be empty set (I1∩I2=φ), I1And I2Project number sum be equal to its former frequent item set project number (I1∪I2=awLi), and I1With I2Support be all not less than support threshold (awsup (I1)≥minsup,awsup(I2) >=minsup), then calculate frequent episode Collection (I1∪I2) item in weights than awIWR (I1,I2) and its dimension than awIDR (I1,I2);awIWR(I1,I2) and awIDR (I1, I2) computing formula it is as follows:
w12And w1、w2Item collection (I is weighted completely respectively1,I2) and its Son item set I1And I2In complete weighted data storehouse AWD Weights summation, k12, k1And k2Respectively item collection (I1,I2) and its Son item set I1And I2Project number, n be database in affairs Record sum.
4.2.2, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 steps1,I2)) Product be more than its dimension ratio (awIDR (I1,I2)) when(That is n × awIWR (I1,I2)>awIDR(I1,I2)), grasped as follows Make:
If 4.2.2.1 I1→I2AwCPIR values (awCPIR (I1→I2)) be not less than confidence threshold value minconf, then dig Excavate all-weighted association I1→I2;If I2→I1AwCPIR values be not less than confidence threshold value (awCPIR (I2→I1)≥ Minconf), then excavate all-weighted association I2→I1;awCPIR(I1→I2) and awCPIR (I2→I1) calculating it is public Formula is as follows:
If 4.2.2.2 (I1∪ I2) support be not less than support threshold (awsup (I1∪ I2)≥ Minsup), then, if 1. I1→ I2AwCPIR values be not less than confidence threshold value (awCPIR (I1→ I2)≥ Minconf), then excavate weighting negative customers rule I completely1→ I2;If 2. I2→ I1AwCPIR values be not less than Confidence threshold value (awCPIR (I2→ I1) >=minconf), then excavate weighting negative customers rule I completely2→ I1; Awsup (I1∪ I2), awCPIR (I1→ I2) and awCPIR (I2→ I1) computing formula it is as follows:
Awsup (I1∪ I2)=awsup (I1∪ I2)=1–awsup(I1)–awsup(I2)+awsup (I1∪I2)
4.2.3, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 steps1,I2)) Product be less than its dimension ratio (awIDR (I1,I2)) when(That is n × awIWR (I1,I2)<awIDR(I1,I2)), grasped as follows Make:
If 4.2.3.1 (I1∪ I2) support be not less than support threshold (awsup (I1∪ I2) >=minsup), So, if 1. I1→ I2AwCPIR values be not less than confidence threshold value (awCPIR (I1→ I2) >=minconf), then dig Excavate weighting negative customers rule I completely1→ I2;If 2. I2→I1AwCPIR values be not less than confidence threshold value (awCPIR (I2→I1) >=minconf), then excavate weighting negative customers rule I completely2→I1;awsup(I1∪ I2)、awCPIR (I1→ I2) and awCPIR (I2→I1) computing formula it is as follows:
awsup(I1→ I2)=awsup(I1∪ I2)=awsup(I1)–awsup(I1∪I2)
If 4.2.3.2 (I1∪I2) support be not less than support threshold (awsup (I1∪I2) >=minsup), So, if 1. I1→I2AwCPIR values be not less than confidence threshold value (awCPIR (I1→I2) >=minconf), then dig Excavate weighting negative customers rule I completely1→I2;If 2. I2→ I1AwCPIR values be not less than confidence threshold value (awCPIR (I2→ I1) >=minconf), then excavate weighting negative customers rule I completely2→ I1;Awsup (I1∪I2)、awCPIR (I1→I2) and awCPIR (I2→ I1) computing formula it is as follows:
Awsup (I1→I2)=awsup (I1∪I2)=awsup(I2)–awsup(I1∪I2)
4.2.4,4.2.1~4.2.3 steps are continued, if awLiProper subclass set in each proper subclass and if only if It is removed once, then proceeds to 4.2.5 steps;
4.2.5,4.1 steps are continued, if each frequent item set in interesting complete weighted frequent items set awPIS awLiAll and if only if is removed once, then proceed to(5)Step;
(5)Excavate from the interesting negative dependent set awNIS of weighting completely, including Following steps:
5.1st, negative dependent awN is taken out from the interesting negative dependent set of weighting completely awNISi, obtain awNiIt is all very son Collection, builds awNiProper subclass set, then carry out following operation:
5.2.1, from awNiProper subclass set in arbitrarily take out two proper subclass I1And I2, work as I1And I2Common factor be empty set (I1∩I2=φ), I1And I2Project number sum be equal to its former frequent item set project number (I1∪I2=awNi), and I1With I2Support both greater than or be equal to support threshold (awsup (I1)≥minsup,awsup(I2) >=minsup), then calculate Negative dependent (I1∪I2) item in weights ratio (awIWR (I1,I2)) and its dimension ratio (awIDR (I1,I2));awIWR(I1,I2) and awIDR(I1,I2) computing formula with 4.2.1 formula.
5.2.2, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 steps1,I2)) Product be more than its dimension ratio (awIDR (I1,I2)) when(That is n × awIWR (I1,I2)>awIDR(I1,I2)), grasped as follows Make:
If 5.2.2.1 (I1∪ I2) support be more than or equal to support threshold (awsup (I1∪ I2) >=minsup), then, if 1. I1→ I2AwCPIR values be more than or equal to confidence threshold value (awCPIR (I1 → I2) >=minconf), then excavate weighting negative customers rule I completely1→ I2;If 2. I2→ I1AwCPIR Value is more than or equal to confidence threshold value (awCPIR (I2→ I1) >=minconf), then excavate Then I2→ I1;Awsup (I1∪ I2), awCPIR (I1→ I2) and awCPIR (I2→ I1) computing formula With the formula of 4.2.2.2.
5.2.3, as weights ratio (awIWR (I in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 steps1,I2)) Product be less than its dimension ratio (awIDR (I1,I2)) when(That is n × awIWR (I1,I2)<awIDR(I1,I2)):
If 5.2.3.1 (I1∪ I2) support be more than or equal to support threshold (awsup (I1∪ I2)≥ Minsup), then, if 1. I1→ I2AwCPIR values be more than or equal to confidence threshold value (awCPIR (I1→ I2)≥ Minconf), then excavate weighting negative customers rule I completely1→ I2;If 2. I2→I1AwCPIR values be more than or wait In confidence threshold value (awCPIR (I2→I1) >=minconf), then excavate weighting negative customers rule I completely2→I1; awsup(I1∪ I2)、awCPIR(I1→ I2) and awCPIR (I2→I1) computing formula with 4.2.3.1 formula;
If 5.2.3.2 (I1∪I2) support be more than or equal to support threshold (awsup (I1∪I2≥ Minsup), then, if 1. I1→I2AwCPIR values be more than or equal to confidence threshold value (awCPIR (I1→I2)≥ Minconf), then excavate weighting negative customers rule I completely1→I2;If 2. I2→ I1AwCPIR values be more than or wait In confidence threshold value (awCPIR (I2→ I1) >=minconf), then excavate weighting negative customers rule I completely2→ I1; Awsup (I1∪I2), awCPIR (I1→I2) and awCPIR (I2→ I1) computing formula with 4.2.3.2 formula;
5.2.4,5.2.1~5.2.3 steps are continued, if awNiProper subclass set in each proper subclass and if only if It is removed once, then proceeds to 5.2.5 steps;
5.2.5,5.1 steps are continued, if each negative dependent awN in the interesting negative dependent set awNIS of weighting completelyiAll And if only if is removed once, then weight positive and negative association rule mining completely and terminate;
So far, weight positive and negative association rule mining completely to terminate.
The present invention compared with prior art, has the advantages that:
(1)For the defect of the positive and negative association rule mining of existing weighting, the present invention is constructed Formula evaluates framework:Support-CPIR models (Conditional Probability Increment Ratio)-correlation-emerging Interesting degree, and the Pruning strategy of frequent item set and negative dependent, it is proposed that a kind of new adding completely based on SCPIRCI evaluation frameworks Positive and negative association rule mining method is weighed, is efficiently solved.The present invention is not only The complete weighted data feature that consideration project changes with data-base recording and changes, using new item collection Pruning strategy, during excavation Between be greatly reduced, be greatly enhanced digging efficiency.
(2)Propose weights ratio and dimension in complete plus item collection item and, than concept, enrich the reason that complete weighted data is excavated By.
(3)By a large amount of strict and careful experiments, by the present invention with traditional item without the positive and negative association rule mining of weighting Method carries out experiment comparison.With Chinese Web test set CWT200g as testing wen chang qiao district collection, become from support change, confidence level The excavation performance experiment Analysis of the aspect to the technology of the present invention such as change, the number of entry and document sets scale change.Experiment knot Fruit shows:Compare with control methods, the excavation performance of the technology of the present invention has reached good effect, and digging efficiency is obtained greatly Improve;Either in support threshold situation of change or confidence threshold value situation of change, the candidate item that the technology of the present invention is excavated Collection, frequent item set and negative dependent and positive and negative correlation rule quantity are few many than what existing control methods was excavated;In item number Under amount and affairs document scale situation of change, the present invention also shows good extensibility.Ananlysis of main cause is as follows:Contrast Method is, without positive and negative association rule mining method is weighted, not account for a collects weights based on the excavation of project frequency, and not having can be complete The characteristics of complete weighted data is intrinsic is reflected in face, thus, item collection and the positive and negative correlation rule of many invalid and falsenesses can be produced Pattern so that the much larger number of item collection and rule, its digging efficiency lower significantly.The invention belongs to based on the complete of weights excavation Positive and negative association rule mining method is weighted entirely, the inherent shortcoming of control methods is effectively overcomed, by complete weighted data model Have the special feature that(I.e. objective being distributed in transaction journal with record change of project weights and change)Incorporate whole mining process In so that the correlation rule for being excavated more rationally and closer to actual, meanwhile, employ new Pruning strategy so that it is invalid and Barren frequent item set and negative dependent quantity are greatly reduced, and effectively reduce barren rule appearance, greatly increase Digging efficiency.
Description of the drawings
Fig. 1 is the frame for finding the complete weighting pattern method for digging of correlation rule between text word of the present invention Figure.
Fig. 2 is the totality for finding the complete weighting pattern method for digging of correlation rule between text word of the present invention Schematic flow sheet.
Fig. 3 is that the present invention tests the candidate quantity comparison diagram excavated under different support thresholds in 1.
Fig. 4 is that the present invention tests the frequent item set quantity comparison diagram excavated under different support thresholds in 1.
Fig. 5 is that the present invention tests rule (A → B) the quantity comparison diagram excavated under different support thresholds in 1.
Fig. 6 is that the present invention tests negative rule (A → B) the quantity comparison diagram excavated under different support thresholds in 1.
Fig. 7 is that the present invention tests negative rule (A → B) the quantity comparison diagram excavated under different support thresholds in 1.
Fig. 8 is that the present invention tests negative rule (A → B) the quantity comparison diagram excavated under different support thresholds in 1.
Fig. 9 is the candidate of different item mesh number in present invention experiment 2, frequent and negative dependent number change figure.
Figure 10 is the positive and negative correlation rule number change figure of different item mesh number in present invention experiment 2.
Figure 11 is the negative customers rule number change figure of different item mesh number in present invention experiment 2.
Figure 12 is the candidate of different document scale in present invention experiment 2, frequent and negative dependent number change figure.
Figure 13 is the negative customers rule number change figure of different document scale in present invention experiment 2.
Figure 14 is the positive and negative correlation rule number change figure of different document scale in present invention experiment 2.
Specific embodiment mode
In order to technical scheme is better described, below by complete weighted data model according to the present invention and phase The concept of pass is described below:
1. the difference that weighted association rules are excavated and all-weighted association is excavated
Weighted association rules are excavated and all-weighted association is excavated, and their main distinction is that its project weights is originated Different with the data model for being excavated, the former project weights are set by user is subjective, and independently of transaction database, once set Fixed, invariable in whole mining process, for example, the copy paper and facsimile machine in shop, as copy paper price is not as passing The height of prototype, its single-piece profit are lower than facsimile machine, different to the importance of profit contribution in order to embody commodity, and user is by single-piece The higher facsimile machine commodity of profit give higher weights, and the weights of copy paper commodity are relatively low, after its weight setting, just Immobilize, and independently of its transaction data base;The project weights of the latter are not to be set by the user, and are derived from affairs It is in database each transaction journal and different with transaction journal and change, for example, each Feature Words in the text database of magnanimity Project weights are derived from each document in its database, change as document is different, i.e., for different documents, its feature Lexical item mesh weights are different.
Item weighted data model and all-weighted item data model are that weighted association rules are excavated and weighting is closed completely respectively The data model of connection rule digging, is diverse two classes data model, as shown in Table 1 and Table 2, is wherein { i1,i2,..., imIt is its project set, { T1,T2,...,TnIt is its affairs set.In weighted data model, { w1,w2,...,wmIt is which Project weights, " 1 " of " 1/0 " represent that project occurs in transaction journal, and " 0 " represents absent variable situation.In complete weighted number According to model, " w [Ti][ij]/0 (1≤i≤n, 1≤j≤m) " represents the weights of project, if project occurs in transaction journal, Its weights is " w [Ti][ij] ", it is otherwise " 0 ".
1 weighted data model table of table, 2 all-weighted item data model
Example:Table 3 has 5 projects and 5 transaction journals, and wherein project set is { i1,i2,i3,i4,i5}={Apple, Orange, Banana, Milk, Coca-cola }, as known from Table 3, i1T is not appeared in3In transaction journal.Table 4 is that an item is complete Full weighted data example, project and transaction journal quantity and with table 3, wherein, project i1In transaction journal T1,T2,T3,T5In Weights are 0.85,0.93,0.65,0.75 respectively, do not appear in transaction journal T4, therefore its weights is 0.
3 weighted data example tables of table, 4 all-weighted item data instance
2. complete weighted data excavates basic conception
If weighted data storehouse AWD={ T completely1,T2,...,Tn, number of transactions is n, TiIn (1≤i≤n) expression AWD i-th Individual affairs, item collection I={ i1,i2,...,imWhole project sets in AWD are represented, item number is m, ij(1≤j≤m) represents AWD In j-th project, w [Ti][ij] (1≤i≤n, 1≤j≤m) expression project ijIn transaction journal TiIn weights, refer to table 2 All-weighted item data model.If I1,I2It is the Son item set of item collection I,And,Provide following substantially fixed Justice:
Define 1 (complete weighted support measure:All-weighted support, abbreviation awsup):Complete weighted support measure Shown in the computing formula such as formula (1) of awsup (I).
Wherein,, n is the transaction journal sum of complete weighted data storehouse AWD, and k is the length of item collection I (That is the project number of I).
Negative dependent and negative customers rule support such as formula (2) is weighted completely to formula (5) Suo Shi.
Awsup (I)=1 awsup (I) (2)
awsup(I1→ I2)=awsup(I1∪ I2)=awsup(I1)–awsup(I1∪I2) (3)
Awsup (I1→I2)=awsup (I1∪I2)=awsup(I2)–awsup(I1∪I2) (4)
Awsup (I1→ I2)=awsup (I1∪ I2)=1–awsup(I1)–awsup(I2)+awsup (I1∪I2) (5)
Define 2 (complete weighted frequent items and negative dependents):If minimum support threshold value is minsup, for weighting completely Item collection I, if awsup (I) >=minsup, item collection I is called complete weighted frequent items.For weighting item collection (I completely1∪I2), Work as I1And I2When being all frequent item set, if awsup is (I1∪I2)<Minsup, then item collection (I1∪I2) be referred to as weighting negative dependent completely.
Example:If minsup=0.1, in 4 data of table, awsup (i2)=(0.21+0.35+0.05)/(5×1)=0.122> Minsup, awsup (i4)=0.192>Minsup, awsup (i2∪i4)=0.06<Minsup, therefore item collection (i2∪i4) it is to add completely Power negative dependent.
Define 3 and (weight item collection interest-degree completely:All-weighted Itemset Interest, i.e., awItemsetInt):Interest-degree is the tolerance of association mode degree of concern of the user to being excavated, and its value is higher, illustrates the pass Gang mould formula is noveler, and user is higher to its degree of concern.Based on the interest-degree model excavated without weighted data under environment (Cheng Jihua, Guo Jiansheng, Shi Pengfei. excavate many strategy process research [J] of rule of interest. Chinese journal of computers, 2000,23 (1):47-51.), be given:
awItemsetInt(I1∪I2)=awsup(I1)×awsup(I1∪I2)×(1–awsup(I2)) (6)
awItemsetInt(I1∪ I2)=awsup(I1)×awsup(I2)×(awsup(I1)–awsup(I1∪I2)) (7)
AwItemsetInt (I1∪I2)=(1–awsup(I1))×(1–awsup(I2)×(awsup(I2)–awsup(I1∪I2)) (8)
AwItemsetInt (I1∪ I2)=awsup(I2)×(1–awsup(I1))×(1–awsup(I1)–awsup (I2)+awsup (I1∪I2)) (9)
Define 4 and (weight CPIR values completely:All-weighted Conditional_Probability Increment Ratio, abbreviation awCPIR):CPIR models are expressing p (I with the ratio of conditional probability and prior probability2/I1) relative p (I2) Incremental degree, give its computing formula in document:CPIR(I2/I1)=(p(I2/I1)–p(I2))/(1–p(I2)).It is based on The needs that the computing formula of CPIR models and complete weighted data are excavated, provide the awCPIR for weighting positive and negative correlation rule completely Computing formula such as formula (10) is to formula (13) Suo Shi:
Using awCPIR values as all-weighted association confidence level, its value is bigger, illustrates the credible of the correlation rule Degree is higher, is more paid close attention to by user.
Example:In 4 complete data of table, awsup (i1)=0.636, awsup (i1)=1-0.636=0.364, awsup (i2)= 0.122, awsup (i1∪i2)=0.294, awCPIR (i1→i2)=(| 0.294-0.636 × 0.122 |)/(0.636 × (1- 0.122))=0.39, awCPIR (i1→ i2)=2.79, awCPIR (i1→i2)=0.68, awCPIR (i1→ i2)= 4.86。
Define 5 (weights ratios in complete weighted term:All-weighted Weight Ratio from Itemset, referred to as awIWR):If w12And w1、w2Item collection (I is weighted completely respectively1,I2) and its Son item set I1And I2In complete weighted data storehouse AWD In weights summation, by w12(w1×w2) ratio referred to as completely weight item collection in weights ratio, weights ratio in abbreviation item (awIWR(I1,I2)), i.e., shown in formula (14).
Define 6 (dimension ratios in complete weighted term:All-weighted Dimension Ratio from Itemset, letter Claim awIDR):If k12, k1And k2Respectively item collection (I1,I2) and its Son item set I1And I2Project number, by k12(k1×k2) Ratio referred to as completely weight item collection in dimensional ratio, dimension ratio (awIDR (I in abbreviation item1,I2)), i.e., shown in formula (15).
Define 7 and (weight item collection correlation completely:All-weighted itemset correlation, referred to as awISCorr):(Chengqi Zhang, Shichao Zhang.Association is defined based on traditional item collection correlation rule mining:models and algorithms[M].Springer-Verlag Berlin,Heidelberg,2002: 47-84,ISBN:3-540-43533-6.), provide weighting item collection (I completely1,I2) correlation (awISCorr (I1,I2),) computing formula such as formula (16) shown in.
According to the property of correlation, excavate under environment in complete weighted data, item collection (I1,I2) correlation has following property Matter:
Property 1:
Property 2:
Property 3:
Property 4:2. awISCorr (I1,I2)<1;③ AwISCorr (I1, I2)>1。
Property 5:2. awISCorr (I1,I2)>1;③ AwISCorr (I1, I2)<1。
Inference is excavated in environment in complete weighted data, it is known that item collection (I1,I2), andIf 1. n × awIWR (I1,I2)>awIDR(I1,I2), then Son item set I is weighted completely1And I2Into positive correlation, and can excavate and weight positive association completely Regular I1→I2With negative customers rule I1→ I2Pattern;If 2. n × awIWR (I1,I2)<awIDR(I1,I2), then weight completely Item collection I1And I2Into negative correlation, and weighting negative customers rule I completely can be excavated1→ I2And I1→I2Pattern;
According to above-mentioned inference, when all-weighted association is excavated, only need to calculate weights in complete weighted term compares awIWR (I1,I2) and dimension than awIDR (I1,I2), it is not required to calculate item collection correlation, it is possible to directly from frequent item set and negative dependent Excavation weights positive and negative correlation rule completely.
Example:For (i1,i2,i3), if I1=(i1,i2), I2=(i3), then awIWR (I1,I2)=3.34/(2.94× 2.85)=0.399, awIDR (I1,I2)=3/ (2 × 1)=1.5, n × awIWR (I1,I2)=5×0.5517=1.995>1.5= awIDR(I1,I2), according to above-mentioned inference, I1And I2Into positive correlation, correlation rule I can be excavated1→I2With negative customers rule I1 → I2Pattern.Verified using formula (16):awsup(i1∪i2)=0.294, awsup (i3)=0.57, awsup (i1∪i2∪i3)= 0.223, awISCorr (I1,I2)=0.223/(0.294×0.57)=1.33>1, by property 1 and property 4, I1And I2Into positive Close, correlation rule I can be excavated1→I2With negative customers rule I1→ I2Pattern, conclusion are consistent.
In the same manner, for weighting item collection (i completely2,i4), its awIWR (i2,i4)=0.102, awIDR (i2,i4)=2, n × awIWR(i2,i4)=0.51<2=awIDR(i2,i4), according to inference, i2And i4Into negative correlation, i can be excavated2→ i4 And i2→i4Pattern.
Define 8 (effectively weighting positive and negative correlation rule completely):If minconf is minimal confidence threshold, when completely plus Claim collection I1And I2Meet following 3 conditions, then claim correlation rule I1→I2, I1→ I2、I1→ I2And I1→I2For having The positive and negative correlation rule of weighting completely of effect:①I1And I2It is complete weighted frequent items, I1∩I2=φ;②I1→I2, I1→ I2、I1→ I2And I1→I2Support be more than or equal to minsup;③I1→I2, I1→ I2、I1→ I2And I1→ I2AwCPIR values be not less than minconf.
Example:Assume minsup=0.1, minconf=0.3, know from upper example, weight item collection (i completely1,i2)、(i3) (i1,i2,i3) support be both greater than minsup, (i1,i2) and (i3) into positive correlation, and because, awCPIR ((i1,i2)→ (i3))=|0.223–0.94×0.57|/(0.294×(1–0.57))=0.438>Minconf, awCPIR ((i1,i2) → (i3))=0.138<Minconf, according to property 4 and definition 8, (i1,i2)→(i3) it is that an effective positive association that weights completely is advised Then, negative rule (i1,i2) → (i3) it is not effective.In the same manner, for weighting item collection (i completely2,i4), due to awsup (i2)=0.122>Minsup, awsup (i4)=0.192>Minsup, awsup (i2∪ i4)=0.062<Minsup, awsup ( i2∪i4)=0.132>Minsup, awCPIR (i2→i4)=0.052<Minconf, according to definition 8, negative customers rule i2→ i4And i2→i4It is not effectively to weight negative customers rule completely.
Technical scheme is described further below by specific embodiment.
Process following (wherein, minsup of the present invention to 4 complete weighted data Case digging all-weighted association of table =0.1, minInt=0.1, minconf=0.4, w represent a collects weights, behalf item collection support):
Step1:awPIS={φ};awNIS={φ};
Step2:
Step3:①
Step4:Beta pruning:For the item collection beta pruning in frequent item set set awPIS.The frequent item set wiped out is:(i2, i3),(i3,i4),(i1,i2,i5),(i1,i3,i5), the awPIS={ (i after beta pruning1,i2),(i1,i3),(i1,i5),(i1,i2, i3)}
Step5:In the same manner, in negative dependent set awNIS, the negative dependent wiped out is:(i3,i5), the awNIS=after beta pruning {(i1,i4),(i2,i4),(i2,i5),(i4,i5)}。
Step6:Excavate from frequent item set set awPIS and in negative dependent set awNIS and weight completely positive negative customers rule Then, with frequent item set (i1,i2,i3) and negative dependent (i4,i5) as a example by, provide its mining process as follows:
For frequent item set (i1,i2,i3), with its subset I1=(i1) and I2=(i2,i3) as a example by, knowable in upper example, awsup(i1)、awsup(i2,i3) it is all higher than minsup, awIDR (I1,I2)=1.5, n × awIWR (I1,I2)=2.98>awIDR (I1,I2), awsup (I1∪I2)=0.223>Minsup, awCPIR (I1→I2)=0.212<Minconf, awCPIR (I2→I1)= 1.73>minconf;Awsup (I1∪ I2)=0.411>Minsup, awCPIR (I1→ I2)=1.73>Minconf, AwCPIR (I2→ I1)=0.212<Minconf, therefore, I2→I1And I1→ I2(i.e. (i2,i3)→(i1) and (i1) → (i2,i3)) it is effectively to weight positive and negative correlation rule completely.
For negative dependent (i4,i5), its subset I1=(i4) and I2=(i5), knowable in upper example, awsup (i4)、awsup (i5) it is all higher than minsup, awIDR (I1,I2)=2, n × awIWR (I1,I2)=1.03<awIDR(I1,I2), awsup (I1∪ I2)=0.101>Minsup, awsup (I1∪I2)=0.093<Minsup, awCPIR (I1→ I2)=1.577>Minconf, AwCPIR (I2→I1)=0.084<Minconf, therefore, I1→ I2(i.e. (i4) → (i5)) it is effectively to weight completely Negative customers rule.
Beneficial effects of the present invention are described further below by experiment.
In order to verify effectiveness of the invention, correctness and autgmentability, we select to be carried by network laboratories of Peking University For Chinese Web test set CWT200g (Chinese Web Test Collection with200GB web pages) portion Divide language material as this paper experimental data test sets.The running environment of experiment is Intel (R) Core (TM) i7-3770CPU@ 3.4GHz3.4GHz, internal memory 4.0G, operating system is windows7, and programming language is realized adopting delphi2006, data base set Unite as SQL Server2008.Select typically without the positive and negative association rule mining method of weighting(Xindong Wu,Chengqi Zhang,and Shichao Zhang,Efficient Mining of Both Positive and Negative Association Rules,ACM Transactions on Information Systems,22(2004),3:381- 405.)(being designated as PNAR-Mining methods) is Experimental comparison's method.
The capacity of Chinese Web test set CWT200g is 197GB, comprising 37,482,913 webpages, and each page is according to day Net storage format is compressed arrangement.It is extracted 12024 plain text documents from CWT200g test sets to survey as experiment document Examination collection.Using Chinese lexical analysis system ICTCLAS (Inst. of Computing Techn. Academia Sinica develops and writes) to test text Document participle.Feature Words weights (wij) computing formula be wij=(0.5+0.5×tfij/maxj(tfij))×idfi.Experiment is surveyed Examination document preprocessing process be:Participle, stop words is removed, is extracted and Feature Words and is calculated its weights, being built based on vector space mould The text database and feature dictionary of type.After the collection pretreatment of experiment wen chang qiao district, 8751 Feature Words, its document frequency is obtained (number of documents i.e. containing this feature word) df is 51 to 11258.According to needs are excavated, remove df values in experiment than relatively low and ratio Higher Feature Words, extract the Feature Words that df values are 1500 to 5838(400 Feature Words are obtained now)Construction feature lexical item Mesh storehouse.Total frequency that Feature Words occur in 12024 experiment test documents is 1019494 times, is averagely gone out in every document It is existing 85 times.Experiment parameter is as shown in table 5.
5 experiment parameter table of table
Experiment 1:Performance comparision is excavated under support threshold situation of change
Under different support thresholds, AWPNAR-Mining and control methods PNAR-Mining is invented herein in experiment text Item collection (i.e. candidate (Candidate Itemset, CI), frequent item set (Frequent is excavated in shelves test set Itemset, FI), negative dependent (Negative Itemset, NI)) and positive and negative correlation rule (Positive and Negative Association Rule, PNAR) quantity compares (ItemNum=50, minconf=0.0002, minInt as shown in Figures 3 to 8 =0.0002,TRecordNum=12024)。
Experiment 2:Performance comparision is excavated under confidence threshold value situation of change
AWPNAR-Mining and control methods PNAR-Mining are invented under confidence threshold value situation of change herein in experiment Wen chang qiao district concentrates excavation positive and negative correlation rule (A → B, A → B, A → B and A → B) quantity more as shown in table 6 (minsup=0.03, minInt=0.0002, ItemNum=50, TRecordNum=12024).
The positive and negative correlation rule quantity excavated under 6 different confidence threshold values of table compares
Experiment 3:Excavate time efficiency Performance comparision
Time efficiency performances are excavated in order to compare 2 kinds of methods, we are respectively under support threshold situation of change and confidence The excavation time for inventing AWPNAR-Mining and control methods PNAR-Mining herein is counted in the case of degree changes of threshold, its knot Fruit is as shown in table 7 and table 8 (minInt=0.0002, ItemNum=50, TRecordNum=12024).Table 7 represents support threshold The lower 2 kinds of method for digging of situation of change concentrate the time for excavating item collection and correlation rule to compare (minconf=in experiment wen chang qiao district 0.0002), table 8 represents that the excavation positive and negative correlation rule time under confidence threshold value situation of change compares (minsup=0.03).
Item collection and correlation rule time (unit are excavated under 7 different support thresholds of table:Second) compare
Time (the unit of positive and negative correlation rule is excavated under 8 different confidence threshold values of table:Second) compare
Experiment 4:Scalable Performance is analyzed
We change extensibility of two kinds of situations to the inventive method from number of entry change and data test set scale Can experiment and analysis.
In order to test the extensibility of the present invention, experiment parameter is set:ItemNum=50, TRecordNum=12024, Minsup=0.05, minconf=0.07, minInt=0.001, change respectively in number of entry change and data test set scale In the case of, AWPNAR-Mining methods of the present invention Mining Frequent Itemsets Based (FI), negative dependent (NI) and just in data test collection 1 Negative customers rule (PNAR) isotype number change result is as shown in Fig. 9 to Figure 14.
In a word, it is above-mentioned test result indicate that, compare with control methods PNAR-Mining, AWPNAR-Mining side of the present invention The excavation performance of method has reached good effect, and digging efficiency is greatly improved;Either change feelings in support threshold Condition or confidence threshold value situation of change, candidate, frequent item set and negative dependent and positive negative customers rule that the present invention is excavated Then quantity is few many than control methods.

Claims (2)

1. a kind of complete weighting pattern method for digging for finding correlation rule between text word, it is characterised in that including as follows Step:
(1) complete weighted data pretreatment stage:Pending complete weighted data is pre-processed, complete weighted number is built According to storehouse and project library;
(2) complete weighted frequent items and negative dependent excavation phase, comprise the following steps 2.1 and step 2.2:
2.1st, extract from project library and weight completely candidate's 1_ item collections, and excavation weights frequent 1_ item collections completely;Concrete steps are pressed Carry out according to 2.1.1~2.1.3:
2.1.1, extract from project library and weight completely candidate's 1_ item collections;
2.1.2, add up and weight weights summation of the candidate 1_ item collections in complete weighted data storehouse completely, calculate its support;
2.1.3 it is frequent more than or equal to the weighting completely of minimum support threshold value that support in candidate's 1_ item collections, is weighted completely 1_ item collections are added to complete weighted frequent items set;
2.2nd, from the beginning of candidate's 2_ item collections are weighted completely, operated according to step 2.2.1~2.2.4:
2.2.1, will weight frequent (i-1) _ item collection completely carries out Apriori connections, generates;It is described I >=2;
2.2.2, add up and weight weights summation of the candidate i_ item collections in complete weighted data storehouse completely, calculate its support;
2.2.3, take out from the frequent i_ item collections for being weighted in candidate's i_ item collections by its support not less than support threshold completely, deposit Enter complete weighted frequent items set, meanwhile, its support has been stored in less than the negative i_ item collections of the weighting completely of support threshold Full weighting negative dependent set;
2.2.4 the value of i is added 1, if frequently (i-1) _ item collection proceeds to (3) step for sky, otherwise, continues 2.2.1~2.2.3 Step;
(3) the beta pruning stage:Interesting complete weighted frequent items and negative dependent are obtained by the beta pruning stage:
3.1st, for each frequent i- item collection awL in frequent item set seti, calculate IAWFI (awLi) value, wipe out its IAWFI (awLi) value is false frequent item set, obtains interesting complete weighted frequent items set after beta pruning;IAWFI(awLi) calculate public Formula is as follows:
Wherein, awItemsetInt (I1∪I2)=awsup (I1)×awsup(I1∪I2)×(1–awsup(I2)), AwItemsetInt (I1, I2)=awsup (I2)×(1–awsup(I1))×(1–awsup(I1)–awsup(I2)+awsup (I1∪I2)), minInt be minimum interestingness threshold value, minsup minimum support threshold values;
3.2nd, for each negative i- item collection awN for being weighted in negative dependent set completelyi, calculate IAWNI (awNi) value, wipe out which IAWNI(awNi) value is false negative dependent, obtains the interesting negative dependent set of weighting completely after beta pruning;IAWNI(awNi) calculating Formula is as follows:
Wherein, awItemsetInt (I1∪I2)=awsup (I1)×awsup(I1∪I2)×(1–awsup(I2));
awItemsetInt(I1∪ I2)=awsup (I1)×awsup(I2)×(awsup(I1)–awsup(I1∪I2));
AwItemsetInt (I1∪I21 awsup (I of)=(1))×(1–awsup(I2)×(awsup(I2)–awsup(I1∪ I2));
AwItemsetInt (I1∪ I2)=awsup (I2)×(1–awsup(I1))×(1–awsup(I1)–awsup(I2)+ awsup(I1∪I2));
(4) excavate from interesting complete weighted frequent items set and effectively weight positive and negative correlation rule completely, including following Step:
4.1st, frequent item set awL is taken out from interesting complete weighted frequent items seti, obtain awLiAll proper subclass, build awLiProper subclass set, then carry out following operation:
4.2.1, from awLiProper subclass set in arbitrarily take out two proper subclass I1And I2, work as I1And I2Common factor be empty set, I1 And I2Project number sum be equal to the project number of its former frequent item set, and I1And I2Support be all not less than support Threshold value, then calculate frequent item set (I1∪I 2) item in weights than awIWR (I1,I2) and its dimension than awIDR (I1,I2); awIWR(I1,I2) and awIDR (I1,I2) computing formula it is as follows:
a w I W R ( I 1 , I 2 ) = w 12 w 1 &times; w 2 ;
a w I D R ( I 1 , I 2 ) = k 12 k 1 &times; k 2 ;
w12And w1、w2Item collection (I is weighted completely respectively1,I2) and its Son item set I1And I2Power in complete weighted data storehouse AWD Value summation, k12, k1And k2Respectively item collection (I1,I2) and its Son item set I1And I2Project number;
4.2.2, when in database transaction journal sum n and above-mentioned 4.2.1 step item in weights than awIWR (I1,I2) product it is big In its dimension than awIDR (I1,I2) when, i.e. n × awIWR (I1,I2)>awIDR(I1,I2) when, proceed as follows:
If 4.2.2.1 I1→I2AwCPIR value awCPIR (I1→I2) be not less than confidence threshold value minconf, then excavate completely Weighted association rules I1→I2;If I2→I1AwCPIR value awCPIR (I2→I1) be not less than confidence threshold value minconf, then dig Excavate all-weighted association I2→I1;awCPIR(I1→I2) and awCPIR (I2→I1) computing formula it is as follows:
awCPIR ( I 1 &RightArrow; I 2 ) = awsup ( I 2 &cup; I 1 ) - awsup ( I 1 ) awsup ( I 2 ) awsup ( I 1 ) ( 1 - awsup ( I 2 ) ) ;
awCPIR ( I 2 &RightArrow; I 1 ) = awsup ( I 2 &cup; I 1 ) - awsup ( I 1 ) awsup ( I 2 ) awsup ( I 1 ) ( 1 - awsup ( I 1 ) ) ;
If 4.2.2.2 I1∪ I2Support awsup (I1∪ I2) it is not less than support threshold minsup, then, 1. If I1→ I2AwCPIR value awCPIR (I1→ I2) be not less than confidence threshold value minconf, then excavate completely Weighting negative customers rule I1→ I2;If 2. I2→ I1AwCPIR value awCPIR (I2→ I1) it is not less than confidence Degree threshold value minconf, then excavate weighting negative customers rule I completely2→ I1;Awsup (I1∪ I2), awCPIR ( I1→ I2) and awCPIR (I2→ I1) computing formula it is as follows:
Awsup (I1∪ I2)=awsup (I1∪ I2Awsup (the I of)=11)–awsup(I2)+awsup(I1∪I2);
4.2.3, when in database transaction journal sum n and above-mentioned 4.2.1 step item in weights than awIWR (I1,I2) product it is little In its dimension than awIDR (I1,I2) when, i.e. n × awIWR (I1,I2)<awIDR(I1,I2) when, proceed as follows:
If 4.2.3.1 I1∪ I2Support awsup (I1∪ I2) it is not less than support threshold minsup, then, if 1. I1→ I2AwCPIR value awCPIR (I1→ I2) be not less than confidence threshold value minconf, then excavate Connection rule I1→ I2;If 2. I2→I1AwCPIR value awCPIR (I2→I1) it is not less than confidence threshold value minconf, Weighting negative customers rule I completely is excavated then2→I1;awsup(I1∪ I2)、awCPIR(I1→ I2) and awCPIR ( I2→I1) computing formula it is as follows:
awsup(I1→ I2)=awsup (I1∪ I2)=awsup (I1)–awsup(I1∪I2);
If 4.2.3.2 I1∪I2Support awsup (I1 ∪ I2) be not less than support threshold minsup, then, 1. such as Fruit I1→I2AwCPIR value awCPIR (I1→I2) be not less than confidence threshold value minconf, then excavate to weight completely and bear Correlation rule I1→I2;If 2. I2→ I1AwCPIR value awCPIR (I2→ I1) it is not less than confidence threshold value Minconf, then excavate weighting negative customers rule I completely2→ I1;Awsup (I1∪I2), awCPIR (I1→I2) and awCPIR(I2→ I1) computing formula it is as follows:
Awsup (I1→I2)=awsup (I1∪I2)=awsup (I2)–awsup(I1∪I2);
4.2.4,4.2.1~4.2.3 steps are continued, if awLiProper subclass set in each proper subclass and if only if is taken Go out once, then proceed to 4.2.5 steps;
4.2.5,4.1 steps are continued, if each frequent item set awL in interesting complete weighted frequent items setiAll when and only When being removed once, then (5th) step is proceeded to;
(5) excavate from the interesting negative dependent set of weighting completely and effectively weight completely negative customers rule, comprise the following steps:
5.1st, negative dependent awN is taken out from the interesting negative dependent of weighting completely seti, obtain awNiAll proper subclass, build awNi Proper subclass set, then carry out following operation:
5.2.1, from awNiProper subclass set in arbitrarily take out two proper subclass I1And I2, work as I1And I2Common factor be empty set, I1 And I2Project number sum be equal to the project number of its former frequent item set, and I1And I2Support both greater than or be equal to Support threshold, then calculate negative dependent I1∪I2Item in weights than awIWR (I1,I2) and its dimension than awIDR (I1,I2);
5.2.2, when in database transaction journal sum n and above-mentioned 5.2.1 step item in weights than awIWR (I1,I2) product it is big In its dimension than awIDR (I1,I2) when, i.e. n × awIWR (I1,I2)>awIDR(I1,I2) when, proceed as follows:
If 5.2.2.1 I1∪ I2Support be more than or equal to support threshold minsup, then, if 1. I1 → I2AwCPIR value awCPIR (I1→ I2) be more than or equal to confidence threshold value minconf, then excavate and add completely Power negative customers rule I1→ I2;If 2. I2→ I1AwCPIR value awCPIR (I2→ I1) be more than or equal to Confidence threshold value minconf, then excavate weighting negative customers rule I completely2→ I1
5.2.3, when in database transaction journal sum n and above-mentioned 5.2.1 step item in weights than awIWR (I1,I2) product it is little In its dimension than awIDR (I1,I2) when, i.e. n × awIWR (I1,I2)<awIDR(I1,I2) when, proceed as follows:
If 5.2.3.1 I1∪ I2Support be more than or equal to support threshold minsup, then, if 1. I1→ I2's AwCPIR value awCPIR (I1→ I2) be more than or equal to confidence threshold value minconf, then excavate Then I1→ I2;If 2. I2→I1AwCPIR value awCPIR (I2→I1) it is more than or equal to confidence threshold value Minconf, then excavate weighting negative customers rule I completely2→I1
If 5.2.3.2 I1∪I2Support be more than or equal to support threshold minsup, then, if 1. I1→I2's AwCPIR value awCPIR (I1→I2) be more than or equal to confidence threshold value minconf, then excavate Then I1→I2;If 2. I2→ I1AwCPIR value awCPIR (I2→ I1) it is more than or equal to confidence threshold value Minconf, then excavate weighting negative customers rule I completely2→ I1
5.2.4,5.2.1~5.2.3 steps are continued, if awNiProper subclass set in each proper subclass and if only if is taken Go out once, then proceed to 5.2.5 steps;
5.2.5,5.1 steps are continued, if each negative dependent awN in the interesting negative dependent set of weighting completelyiAll and if only if quilt Take out once, then weight positive and negative association rule mining completely and terminate;
" " is negatively correlated symbol, I1Expression occurs without I in issued transaction1Event, referred to as negative dependent I1;I1∪ I2 An item collection is represented, the item collection has Son item set I1With negative Son item set I2;Correlation rule I1→ I2Which is meant that:If subset I1's Event occurs or occurs, then subset I2Event be not in or do not occur.
2. the complete weighting pattern method for digging for finding correlation rule between text word according to claim 1, which is special Levy and be, what described pending complete weighted data was pre-processed concretely comprises the following steps, when pending complete weighted data is During Chinese text data, participle is carried out, stop words is removed, is extracted Feature Words and calculate its weights;When pending weighting completely When data are English text data, stem extraction are carried out, stop words, lexical analysis is excluded, is extracted Feature Words and calculate its weights.
CN201410096985.2A 2014-03-14 2014-03-14 Completely-weighted mode mining method for discovering association rules among texts Expired - Fee Related CN103838854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410096985.2A CN103838854B (en) 2014-03-14 2014-03-14 Completely-weighted mode mining method for discovering association rules among texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410096985.2A CN103838854B (en) 2014-03-14 2014-03-14 Completely-weighted mode mining method for discovering association rules among texts

Publications (2)

Publication Number Publication Date
CN103838854A CN103838854A (en) 2014-06-04
CN103838854B true CN103838854B (en) 2017-03-22

Family

ID=50802351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410096985.2A Expired - Fee Related CN103838854B (en) 2014-03-14 2014-03-14 Completely-weighted mode mining method for discovering association rules among texts

Country Status (1)

Country Link
CN (1) CN103838854B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182527B (en) * 2014-08-27 2017-07-18 广西财经学院 Association rule mining method and its system between Sino-British text word based on partial order item collection
CN104239430B (en) * 2014-08-27 2017-04-12 广西教育学院 Item weight change based method and system for mining education data association rules
CN104239536A (en) * 2014-09-22 2014-12-24 广西教育学院 Completely-weighted course positive and negative association pattern mining method and system based on mutual information
CN104217013B (en) * 2014-09-22 2017-06-13 广西教育学院 The positive and negative mode excavation method and system of course based on the item weighted sum item collection degree of association
CN109471885B (en) * 2018-09-30 2022-05-31 齐鲁工业大学 Data analysis method and system based on weighted positive and negative sequence mode

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5809499A (en) * 1995-10-20 1998-09-15 Pattern Discovery Software Systems, Ltd. Computational method for discovering patterns in data sets
CN101650730A (en) * 2009-09-08 2010-02-17 中国科学院计算技术研究所 Method and system for discovering weighted-value frequent-item in data flow
CN102306183A (en) * 2011-08-30 2012-01-04 王洁 Transaction data stream closed weighted frequent pattern (DS_CWFP) mining method
CN103279570A (en) * 2013-06-19 2013-09-04 广西教育学院 Text database oriented matrix weighting negative pattern mining method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5809499A (en) * 1995-10-20 1998-09-15 Pattern Discovery Software Systems, Ltd. Computational method for discovering patterns in data sets
CN101650730A (en) * 2009-09-08 2010-02-17 中国科学院计算技术研究所 Method and system for discovering weighted-value frequent-item in data flow
CN102306183A (en) * 2011-08-30 2012-01-04 王洁 Transaction data stream closed weighted frequent pattern (DS_CWFP) mining method
CN103279570A (en) * 2013-06-19 2013-09-04 广西教育学院 Text database oriented matrix weighting negative pattern mining method

Also Published As

Publication number Publication date
CN103838854A (en) 2014-06-04

Similar Documents

Publication Publication Date Title
CN103838854B (en) Completely-weighted mode mining method for discovering association rules among texts
CN103955542B (en) Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method
CN106991092B (en) Method and equipment for mining similar referee documents based on big data
CN103279570B (en) A kind of matrix weights negative mode method for digging of text-oriented data base
CN103207899B (en) Text recommends method and system
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
Dulá An algorithm for data envelopment analysis
US8874581B2 (en) Employing topic models for semantic class mining
Ignatov et al. Can triconcepts become triclusters?
CN103886108B (en) The feature selecting and weighing computation method of a kind of unbalanced text set
CN101398814A (en) Method and system for simultaneously abstracting document summarization and key words
CN104182527B (en) Association rule mining method and its system between Sino-British text word based on partial order item collection
CN106156357A (en) Text data beam search method
CN104317794B (en) Chinese Feature Words association mode method for digging and its system based on dynamic item weights
CN104346459A (en) Text classification feature selecting method based on term frequency and chi-square statistics
CN101187919A (en) Method and system for abstracting batch single document for document set
Gao et al. Pattern-based topic models for information filtering
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN106202552A (en) Data search method based on cloud computing
Williams et al. Understanding and inferring units in spreadsheets
Cekinel et al. Event prediction from news text using subgraph embedding and graph sequence mining
CN105224689A (en) A kind of Dongba document sorting technique
Zhang et al. Bias–variance analysis in estimating true query model for information retrieval
Chowdhury et al. Crime monitoring from newspaper data based on sentiment analysis
Zhang et al. GSPSummary: a graph-based sub-topic partition algorithm for summarization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
CB03 Change of inventor or designer information

Inventor after: Huang Mingxuan

Inventor before: Huang Mingxuan

Inventor before: Yuan Changan

COR Change of bibliographic data
TA01 Transfer of patent application right

Effective date of registration: 20160317

Address after: Nanning City, 530003 West Road Mingxiu the Guangxi Zhuang Autonomous Region No. 100

Applicant after: Guangxi Finance and Economics Institute

Address before: Nanning City, the Guangxi Zhuang Autonomous Region Qingxiu District JianZheng Road No. 37 530023

Applicant before: Guangxi College of Education

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170322

Termination date: 20180314