Font Size: a A A

Research Of Parallel Algorithm For Effectively Mining Association Rules

Posted on:2009-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2178360242481569Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid progress of data collection and data storage technology makes it possible to accumulate massive data. However, extracting useful information from these huge volumes of data has become a great challenge. In general, the databases are always very large, so it is difficult to use traditional data analysis tools and techniques to deal with them. In this way, when new methods are needed, Data Mining comes into being.Data Mining is also known as knowledge discovery in database (KDD). Its main purpose is to extract knowledge that is implied, unknown prior to the mining, user-interested and has potential value for decision-making. After more than 10 years of hard work, data mining has brought about many new ideas and methods. Especially in recent years, some of its basic concepts and methods have become clear. Its study is moving on in the direction of greater depth.Data Mining System can mine various types of models, in which association model that describes the interesting links between items in given data sets is a very important research direction. Association rule is first put forward by Rakesh Agrawal, Tomasz Imielinski et al. from the United States of the IBM Almaden Research Center in the 1993 ACM SIGMOD Data Management Branch. It was originally derived from the analysis of data on the supermarkets, which is used in the supermarket to find the implied relationships between the goods that users purchase and provide the basis for decision-making marketing.The task of Association Rules Mining is to find the rules whose support and confidence are greater than (or equal to) the user-specified minimum support and confidence. It can be divided into two steps. The first step is to find all the frequent sets that meet the user-specified minimum support. It is the key step in association rules mining because it concentrates almost all of the calculation. At present most of research has focused on this step. The second step is to generate all the association rules that meet the minimum confidence from the frequent sets. This step is direct and efficient because it does not require scanning the database. Association rules usually have the following form: A?B.Data mining directly faces the massive data. Such data usually have more than 100 attributes and millions of records, and inevitably lead to the proliferation of search dimension and search space in data mining process. The enormous size of the database, the remote distribution of data and the calculation complexity of data mining methods themselves call for parallel data mining, the basic idea of which is to divide a big calculation problem into many smaller sub-problems based on certain rules, and schedule the sub-problems to different nodes with high performance for calculating.In this paper, I analyze some of the existing parallel algorithms for mining association rules, and sum up several effective strategies for mining association rules in parallel. Overall, the existing methods can be summed up to the following aspects: The traversing direction is top-down, bottom-up, or both; the strategy used for searching is breadth-first or depth-first priority strategy; whether it is needed to generate candidate itemsets for frequent itemsets'generation; whether it is needed to scan the database for several times when testing a candidate itemset; using horizontal or vertical database layout to express the transactions and so on. Traversal methods, search strategies and database layout may have different combination ways which have a great impact on the efficiency of the algorithm.Following that, on the basis of Apriori algorithm, a matrix-based algorithm for mining association rules, Matrix_Apriori algorithm, is designed. Different from Apriori algorithm, MA algorithm converts a transaction database to bit strings that can be stored into memory, which makes the calculation of support much simpler than Apriori algorithm. Furthermore the database is stored in memory, so we don't have to scan database for many times. This can reduce disk I/O and the time for scanning the database, and improve the response time of the algorithm.I design a parallel algorithm for mining association rules - Parallel Matrix_Apriori algorithm based on MA algorithm. At first, the algorithm changes the database into a bit string at one node, divides it into uniform parts, and distributes them to all the other nodes. Then each node mines their own local database to compute the support counts and get the frequent itemsets whose support counts are greater than the local minimum support.These frequent itemsets are broadcasted to all the other nodes in order to get the support counts on those nodes. At last, one of the nodes reduces the global support counts and broadcast them to all the other nodes. Each node gets these counts, and finally gets the global frequent itemsets through removing the itemsets whose support counts are less than the global minimum support. Repeat the above operations, until no new global frequent itemsets are produced.In the Linux cluster environment, I use Message Passing Interface (MPI) to realize the parallel algorithm described in this paper. In order to provide some experimental data for the serial and parallel algorithms, I achieve the synthetic data generation program that Agrawal and his workmates put forward in their paper. Then, perform some experiments on the serial and parallel algorithms using these synthetic data. Analyze the scaleup, speedup, sizeup of the algorithm. Experimental results show that our algorithm has a good efficiency.At present, the research of mining association rules in parallel has made a great progress, but there are still insufficiencies in some respects and need further study to propose better solutions:1. At present, the normal way to improve the parallel algorithm for mining association rules is to reduce the calculation load of each node through data partition or task partition. Most of the existing parallel algorithms achieve this in a static manner. PMA algorithm is no exception. But this usually leads to unbalanced load and impacts the response time of algorithm because of its more synchronization and wait. So, achieving dynamic load balance is the research direction of next phase;2. With the decrease of minimum support, the number of itemsets will increase greatly. Since our algorithm has to transfer the sets among the nodes in the clusters, if the minimum support is very low, the algorithm will have to endure enormous communication pressure, thus affect the response time of the algorithm. As a result of that, pruning effectively in the mining process to reduce inter-node communication is one of the important directions in the future work;3. If we can use some parallel-special performance show software or debugging software, such as TotalView, while programming parallel algorithms, the programming efficiency of the parallel algorithms will be greatly enhanced. This will also be conducive to the development of high performance parallel procedures.
Keywords/Search Tags:Effectively
PDF Full Text Request
Related items