Font Size: a A A

Research On Algorithms And Their Performance For Frequent Itemset And High Utility Itemset Mining

Posted on:2014-08-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J F QuFull Text:PDF
GTID:1318330398455128Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Computer science is a discipline related to information, which is stored, dissemi-nated and used by computers in the form of data. Most data are in databases, and therefore the management and application techniques of databases are always the hot spots of research. Recently, data mining and knowledge discovery as a database appli-cation receives considerable attention. The reason is that when confronted with massive data from real world, users may only need particular data.Mining sets of items meeting specified conditions from relational databases, name-ly itemset mining, is a fundamental problem in data mining due to two reasons. One reason is that researchers can define various itemset mining problems based on differ-ent conditions. These problems are associated with one another, which means that a solution to a problem may be revised and used in other problems. The other reason is mined itemsets can be used immediately and in many applications such as associa-tion mining, classification, and clustering. Generally, an itemset mining problem has a simple definition and several solutions but a huge search space. For a database with n items,2" itemsets need to be checked. Therefore, researchers are very interested in the efficiency of a solution, namely, how fast a related algorithm performs.This dissertation studies the problem of mining frequent itemsets and that of min-ing high utility itemsets. The former mines itemsets that are frequently contained in the records in a database, and the latter mines itemsets with high utilities measured by item attributes and records. Given a database and a minimum frequency (or utility) threshold, the solution to the former (or latter) is the complete set of itemsets with frequencies (or utilities) exceeding the threshold. Focusing on performance improve-ment, the dissertation introduces four algorithms to solve the two problems. The main contributions are as follows.·FP-growth is a classic algorithm for frequent itemset mining, in which three major costs are prefix-tree traversal, frequency counting, and prefix-tree construction. We proposed the BFP-growth algorithm of which the core idea is to batch. BFP-growth reduces the costs of both prefix-tree traversal and prefix-tree construction. Moreover, the frequency counting cost of BFP-growth remains unchanged, compared with the FPgrowth*algorithm. FPgrowth*is the previous fastest variant of FP-growth, and it reduces traversal cost but increases counting cost. Experimental results show that BFP-growth outperforms FP-growth and FPgrowth*.·With a sound theoretical basis, the NS algorithm is proposed to mine frequent itemsets. In the process of mining frequent itemsets, FP-growth employs prefix-tree, and the significant Eclat algorithm uses tid-list. Prefix-tree is a compressed structure and tid-list is a contiguous structure. The data structure used in NS is node-set, which is a compressed and contiguous structure. By experiment, NS proves to be faster than FPgrowth*, dEclat (the fastest variant of Eclat), and LCMv2(the fastest algorithm for frequent itemset mining).·Current high utility itemset mining algorithms use a similar procedure:generating candidate iternsets and subsequently computing exact utility for candidates. These algorithms incur a problem that most of candidates are not high utility and dis-carded finally in most cases. The dissertation introduces the HUI-Miner algorithm, which uses a utility-list structure to mine high utility iternsets without candidate generation. Therefore, HUI-Miner is several orders of magnitude better than the state-of-the-art algorithms, such as UP-Growth+, in terms of running time and memory consumption.·Simultaneously, we study the exact utility computation, namely the second step in previous high utility itemset mining algorithms. Previous researchers did not pay adequate attention to the step, which spends more running time than the candi-date generation step in factual mining tasks, however. In the dissertation, a fast algorithm for exact utility computation, called FIA, is proposed. In our experi-ments, an improved UP-Growth+algorithm integrating FIA can match HUI-Miner at performance.
Keywords/Search Tags:data mining, algorithm, frequent iteinset, high utility iteinset
PDF Full Text Request
Related items