Font Size: a A A

Research For Association Rules Algorithm On Big Data

Posted on:2017-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhouFull Text:PDF
GTID:2348330509962617Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet has led to the rapid growth of data, many enterprise data produced every day has reached TB or PB grade level. Facing with such huge data sets, data mining process is restricted by complex algorithms, operating platform and other issues, making it impossible to achieve the desired effect. Association rules is an important branch in Data mining field, it has a wide range of applications. With the rise and development, Hadoop as an open source platform provides a new direction for the solution for data mining algorithm. Hadoop has great fault tolerance and scalability, it can solve the problem in some inspect during association Rules algorithm execution: heavy calculation burden and I/O overburdened.This paper focuses on two classical association rules: Apriori algorithm based on the level of the data set and EClat algorithm based on vertical data set, Two algorithms to achieve their improved algorithm based on the Hadoop platform. Apriori algorithm which requires iterative calculations and repeatedly scan the database, dose not adapt to the Hadoop platform, we propose a strategy by reducing the data set and determine the highest order Km, pruning based on 2-frequent itemsets to generate a 3-Km itemsets. For Eclat algorithm, we present both Hadoop platform parallel algorithm: D-MREclat algorithm and A-MREclat algorithm. In the first algorithm, data is cut into small blocks according to the scope of data collection, calculation of the number of the intersection operation is reduced during the next phase, to improve the operating efficiency. The second algorithm introduced Apriori thought, based on the search space divide the prefixes, achieve Eclat algorithm parallelization, this method is more efficiency in handling large data sets. A large number of candidate sets are generated during operation Eclat algorithm, using too much memory. In this paper, the computation results will be compressed by using a compression method for storing data. This approach can reduce the network traffic, to improve the operating efficiency.Finally, achieve three kinds of improved algorithms parallel association rules. Use Hadoop cluster, using different types, size of the data set to test the performance of the algorithm. Experimental results show that the improved algorithm exhibits a better performance.
Keywords/Search Tags:Hadoop, MapReduce, association rules, frequent itemsets, Parallel Computing
PDF Full Text Request
Related items