Font Size: a A A

High-utility Association Rule Mining

Posted on:2009-04-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Z YuFull Text:PDF
GTID:1118360275954984Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
The association rule mining(ARM) aims at extracting interesting correlations between items,and is one of the most important tasks of data mining.ARM has been widely used in many fields such as scientific research,telecommunication networks, market and risk management,customer relationship management,inventory control, military and so on.However,traditional ARM,which uses support to measure the significance of a rule,can not discover some rules that have a low support but a high utility,and these rules maybe very important to users.High-utility association rule mining(HUARM),which is the main content of this paper,can overcome the shortcomings of traditional ARM in reflecting the semantic importance of an itemset.It can reflect users' preference,providing a better support to decision making.In this paper,we studied utility-based association rule mining in large high dimensional data, and our approach solved the problem that existing algorithms for utility mining can not handle large high dimensional datasets.By integrating the advantages of support and utility,we also proposed a new task of utility and support based association rule mining (USBARM),which can find more interesting rules.The main contributions of the paper include:(1) Proposed a novel algorithm,Inter-transaction,to mine high-utility long itemsets from large high dimensional data.The algorithm is based on row enumeration, which can find high-utility long itemsets directly by intersecting long transactions, without extending short itemsets step by step.In high dimensional data,there are few common items between long transactions,and the intersection of multiple long transactions usually leads to a very short itemset(intersection transaction),so this kind of row enumeration method has a good convergence.By adopting the partition method, the algorithm needs to scan the database only twice,and can work well in large high dimensional data.In the meantime,new pruning strategies are used,and large number of candidates can be filtered off. (2) Proposed a hybrid algorithm which search high utility itemsets from two directions.Existing algorithms for utility mining adopt an Apriori-like bottom/up search and need to scan the database multiple times.If the data is very large and contains many long patterns,the I/O burden will be too high.This paper proposed a hybrid algoritm which searches high utility itemsets both from the top and from the bottom.The hybrid algorithm firstly decomposes the mining task into two easy parts (mining short high utility itemsets and mining long high utility itemsets),and then choose different algorithms to finish the two subtasks separately,without the expensive process of extending short itemsets step by step to find long itemsets.(3) Proposed a method to optimize the intersection operation of long transactions. Our algorithm stores a transaction in both HIV format and HIL format,and uses the information contained in HIL format to reduce the number of bit-wise "AND" operation.In this way,the number of "AND" operation in HIV format is equal to the number of non-zero figures in HIL format,having nothing to do with the length of the data in HIV format.The optimization method can solve the problem that the performance of the intersection of long transactions decreases rapidly with the increase of the length of bit-vectors,and can guarantee a high performance in high dimensional data.The optimization method of time-space trade-offs can also improve the the efficiency of vertical algorithms for frequent itemsets mining.(4) Proposed the task of utility and support based association rule mining (USBARM).Support and utility can reflect the statistical importance and semantic importance of an itemset respectively.However,whether an itemset is interesting to a person not only depends on its objective factor such as the support of the itemset,but also on person's subjective factor such as the preference of the user.In order to overcome the shortcomings of individual measure(support or utility),we proposed a new measure,motivation,to measure the importance of an itemset.Motivation is defined as the product of utility and support,and it shows the possibility of a user's obtaining a certain utility or the utility a user can obtain in a certain possibility.In USBARM,all high motivation itemsets can be found,without missing itemsets whose motivatin is high but the support(or utility) is less than a user-defined threshold. USBARM can find more interesting patterns.(5) The paper also proved that motivation has two important properties:upper bound property and transaction-weighted downward closure property.Based on the properties,we designed two algorithms,i.e.,HM-Miner and HM-Two-Phase-Miner,to discover all high utility frequent itemsets which satisfy the support threshold,utility threshold and motivation threshold respectively.Two algorithms adopt Apriori-like bottom/top search,and are suitable for short pattern datasets.HM-Miner uses the upper bound property to cut down search space,whereas HM-Two-Phase-Miner uses transaction-weighted downward closure property to achieve the same goal.(6) Developed a real application system to mine high utility association rules.The application system can output the support,utility and motivation of every association rule(itemset) discovered so that users can compare the difference between support-based association rule mining and high utility association rule mining.We applied our application system to shopping basket data and the mining result show that high utility ARM can discover some interesting patterns missed by traditional ARM, and can help trade people find combinations of some high utility commodity, promoting the sales of high margin productions.By proper data preprocessing,the application system can also be used in other fields.For example,in webpage analysis, the frequency of visits of a page and the time spent on the page can be used to measure the popularity of the page.If we view the time as the utility of one page,we can convert the web mining task into high utility itemsets mining task.
Keywords/Search Tags:Data mining, Association rules, Row enumeration, Column enumeration, Hybrid algorithm, Frequent itemsets, High utility itemsets
PDF Full Text Request
Related items