Remerciez-le!

Remerciez @Admin pour avoir partagé cet document gratuitement, de la manière la plus simple, en partageant sur les réseaux sociaux.

Association Rules Outline Goal: Provide an overview of basic Association Rule m

Association Rules Outline Goal: Provide an overview of basic Association Rule mining techniques • Association Rules Problem Overview – Large itemsets • Association Rules Algorithms – Apriori – Eclat Example: Market Basket Data • Items frequently purchased together: Bread PeanutButter • Uses: – Placement – Advertising – Sales – Coupons • Objective: increase sales and reduce costs Association Rule Definitions • Set of items: I={I1,I2,…,Im} • Transactions: D={t1,t2, …, tn}, tj I • Itemset: {Ii1,Ii2, …, Iik}  I • Support of an itemset: Percentage of transactions which contain that itemset. • Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60% Association Rule Definitions • Association Rule (AR): implication X  Y where X,Y  I and X  Y = ; • Support of AR (s) X  Y: Percentage of transactions that contain X Y • Confidence of AR () X  Y: Ratio of number of transactions that contain X  Y to the number that contain X Association Rules Ex (cont’d) Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence. • Link Analysis • NOTE: Support of X  Y is same as support of X  Y. Association Rule Techniques 1. Find Large Itemsets. 2. Generate rules from frequent itemsets. Algorithm to Generate ARs Apriori • Large Itemset Property: Any subset of a large itemset is large. • Contrapositive: If an itemset is not large, none of its supersets are large. Large Itemset Property Apriori Ex (cont’d) s=30%  = 50% Apriori Algorithm 1. C1 = Itemsets of size one in I; 2. Determine all large itemsets of size 1, L1; 3. i = 1; 4. Repeat 5. i = i + 1; 6. Ci = Apriori-Gen(Li-1); 7. Count Ci to determine Li; 8. until no more large itemsets found; Apriori-Gen • Generate candidates of size i+1 from large itemsets of size i. • Approach used: join large itemsets of size i if they agree on i-1 • May also prune candidates who have subsets that are not large. Apriori-Gen Example Apriori-Gen Example (cont’d) Apriori Adv/Disadv • Advantages: – Uses large itemset property. – Easily parallelized – Easy to implement. • Disadvantages: – Assumes transaction database is memory resident. – Requires up to m database scans. Classification based on Association Rules (CBA) • Why? – Can effectively uncover the correlation structure in data – AR are typically quite scalable in practice – Rules are often very intuitive • Hence classifier built on intuitive rules is easier to interpret • When to use? – On large dynamic datasets where class labels are available and the correlation structure is unknown. – Multi-class categorization problems – E.g. Web/Text Categorization, Network Intrusion Detection Example: Text categorization • Input – <feature vector> <class label(s)> – <feature vector> = w1,…,wN – <class label(s)> = c1,…,cM • Run AR with minsup and minconf – Prune rules of form • w1  w2, [w1,c2]  c3 etc. – Keep only rules satisfying the constraing • W  C (LHS only composed of w1,…wN and RHS only composed of c1,…cM) CBA: Text Categorization (cont.) • Order remaining rules – By confidence • 100% – R1: W1  C1 (support 40%) – R2: W4  C2 (support 60%) • 95% – R3: W3  C2 (support 30%) – R4: W5  C4 (support 70%) – And within each confidence level by support • Ordering R2, R1, R4, R3 CBA: contd • Take training data and evaluate the predictive ability of each rule, prune away rules that are subsumed by superior rules – T1: W1 W5 C1,C4 – T2: W2 W4 C2 Note: only subset – T3: W3 W4 C2 of transactions – T4: W5 W8 C4 in training data – T5: W9 C2 • Rule R3 would be pruned in this example if it is always subsumed by Rule R2 • For remaining transactions pick most dominant class as default – T5 is not covered, so C2 is picked in this example Formal Concepts of Model • Given two rules ri and rj, define: ri  rj if The confidence of ri is greater than that of rj, or Their confidences are the same, but the support of ri is greater than that of rj, or Both the confidences and supports are the same, but ri is generated earlier than rj. • Our classifier model is of the following format: <r1, r2, …, rn, default_class>, where ri R, ra  rb if b>a • Other models possible – Sort by length of antecedent Using the CBA model to classify • For a new transaction – W1, W3, W5 – Pick the k-most confident rules that apply (using the precedence ordering established in the baseline model) – The resulting classes are the predictions for this transaction • If k = 1 you would pick C1 • If k = 2 you would pick C1, C2 (multi-class) – Similarly if W9, W10 you would pick C2 (default) – Accuracy measurements as before (Classification Error) CBA: Procedural Steps • Preprocessing, Training and Testing data split • Compute AR on Training data – Keep only rules of form X C • C is class label itemset and X is feature itemset • Order AR – According to confidence – According to support (at each confidence level) • Prune away rules that lack sufficient predictive ability on Training data (starting top-down) – Rule subsumption • For data that is not predictable pick most dominant class as default class • Test on testing data and report accuracy Association Rules: Advanced Topics Apriori Adv/Disadv • Advantages: – Uses large itemset property. – Easily parallelized – Easy to implement. • Disadvantages: – Assumes transaction database is memory resident. – Requires up to m database scans. Vertical Layout • Rather than have – Transaction ID – list of items (Transactional) • We have – Item – List of transactions (TID-list) • Now to count itemset AB – Intersect TID-list of itemA with TID-list of itemB • All data for a particular item is available Eclat Algorithm • Dynamically process each transaction online maintaining 2-itemset counts. • Transform – Partition L2 using 1-item prefix • Equivalence classes - {AB, AC, AD}, {BC, BD}, {CD} – Transform database to vertical form • Asynchronous Phase – For each equivalence class E • Compute frequent (E) Asynchronous Phase • Compute Frequent (E_k-1) – For all itemsets I1 and I2 in E_k-1 • If (I1 ∩ I2 >= minsup) add I1 and I2 to L_k – Partition L_k into equivalence classes – For each equivalence class E_k in L_k • Compute_frequent (E_k) • Properties of ECLAT – Locality enhancing approach – Easy and efficient to parallelize – Few scans of database (best case 2) Max-patterns • Frequent pattern {a1, …, a100}  (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 frequent sub-patterns! • Max-pattern: frequent patterns without proper frequent super pattern – BCDE, ACD are max-patterns – BCD is not a max-pattern Tid Items 10 A,B,C,D, E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 Frequent Closed Patterns • Conf(acd)=100%  record acd only • For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern – “acd” is a frequent closed pattern • Concise rep. of freq pats • Reduce # of patterns and rules • N. Pasquier et al. In ICDT’99 TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f Min_sup=2 Mining Various Kinds of Rules or Regularities • Multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity • Classification, clustering, iceberg cubes, etc. Multiple-level Association Rules • Items often form hierarchy • Flexible support settings: Items at the lower level are expected to have lower support. • Transaction database can be encoded based on dimensions and levels • explore shared multi-level mining uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support ML/MD Associations with Flexible Support Constraints • Why flexible support constraints? – Real life occurrence frequencies vary greatly • Diamond, watch, pens in a shopping basket – Uniform support may not be an interesting model • A flexible model – The lower-level, the more dimension combination, and the long pattern length, usually the smaller support – General rules should be easy to specify and understand – Special items and special group of items may be specified individually and have higher priority Multi-dimensional Association uploads/Finance/ audit-guide 2 .pdf