Font Size: a A A

The Research On Algorithm For Association Rules Mining Based On Vertical Data Presentation

Posted on:2010-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z C SunFull Text:PDF
GTID:2198360278963266Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The ability of collecting data was improved dramatically day by day in the last recent ten years. Every walk of life has accumulated masses of data with the increasing application of database and network technology. The explosive growth of data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge. The technology of data mining emerges under this background. Association rule mining which was applied for many domains was an important branch of data mining.We introduce the basic theory and technique of data mining first, and then discuss the basic theory and familiar algorithm of association rules mining especially.A number of vertical mining algorithms which have shown to be very effective have been proposed recently for association mining. This advantage stems from the fact that frequent patterns can be counted via tidset or diffset intersections, instead of using complex internal data structures. These approaches need lots of memory to hold the list of transaction ids while enumerate all frequent itemsets. The limited memory's capability maybe becomes the most bottlenecks especially when the user specified support is lower or the faced dataset is too large. The vertical approaches will start to suffer dramatically even can't accomplish the task of mining when the size of intermediate results exceed the capability of main memory. The diffsets algorithm has resolved the problems mentioned above to some extent by the"diffset"idea. This paper presents an improved Diffsets algorithm in the interest of enhancing the main memory utilization. This improved algorithm cuts down the size of intermediate results much more by ranking the number of transaction ids in a degressive way during the calculation course. The analyses and examples show that this improved algorithm not only takes less memory space in the operation process, but also accelerates the convergence pace of the algorithm.Diffsets algorithm represents transaction ids by positive integer. The most common representation of a positive integer is a string of bits consisting of 32 bits. Dif-bits algorithm uses a binary compression technique for the sake of enhancing the main memory utilization more. It uses shorter binary bits vector instead of positive integer, in this way, the main memory's space used can be cut down dramatically. But lots of datasets are dense or skew in the actual application, we also waste lots of main memory space if we only use the format of dif-bits. To solve this problem, we present a new hybrid compressing algorithm for data mining: HC-DM algorithm. This algorithm can distinguish the dataset which part is dense and which is sparse, and then enhances the main memory utilization and makes the algorithm have more scalability by storing either part in different format. Performance evaluations indicate that integrating this algorithm with improving Diffsets algorithm can dramatically cut down the size of memory required when enumerates all frequent itemsets.We give a simple discussion and view of distributed frequent itemset mining based on vertical data presentation at last.
Keywords/Search Tags:data mining, association rule mining, frequent itemset mining, diffsets, HC-DM algorithm
PDF Full Text Request
Related items