Font Size: a A A

Research And Implementation Of Mining Algorithm For Association Rules In Big Data Based On Hadoop

Posted on:2016-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:J G LiaoFull Text:PDF
GTID:2308330479993918Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, with the explosion growth of data volume, how to mine valuable information of big data has been widespread attention. Data mining technology is currently important technical means to solve this problem. By mining the frequent item sets in datasets to derive association rules is an important content of data mining technology. However, with the advent of the era of big data, the traditional data mining algorithms can not adapt to the characteristics of big data, therefore studying and proposing a new data mining algorithm to adapt to the big data environment have become very urgent and need.This article in-depth analyzes and researches the current domestic and foreign big data mining algorithms, we put forward an effective and fast algorithm for mining association rules in big data, expect to solve the problem of speed slow when facing big data.In this thesis, the main work can be summarized as the following respects:(1) The current status and existing problems of data mining technology are analyzed and studied. The contradiction happened in increasing the amount of data and people’s desire for valuable information, and between the growth of the amount of data and the current hardware development speed is the current big data environment necessary to solve the main contradiction. Big data does not make people reduce the speed of data mining, on the contrary, people hope to be able to get a quick and accurate method to dig out valuable information of big data.(2) Analyze the current domestic and foreign data mining algorithms, distributed computing framework-Hadoop and distributed computing model-Map Reduce. Apriori searches the database for many times, did a lot of I/O overhead, although FPGrowth uses FPTree tree structure to compress the original database, but during the iteration subtree structure is too much, It can’t conducive to the process of data mining. Hadoop reduces the difficulty of distributed programming, and easy to manage, at the same time the Map Reduce is very suitable for association rules mining, so, Hadoop and Map Reduce have cerntain advantages for mining association rules in big data environment.(3) Study Pre Post algorithm and its improved algorithm is given. Pre Post algorithm combines the advantage of the FPGrowth algorithm and vertical mining algorithm, but it uses the way similar to the Apriori algorithm to get frequent items. Although the merging two N-list is linear time complexity, but if K-frequent itemsets has S, then the algorithm needs to compare(S*(S-1))/2 times, this makes the time overhead to be reckoned. And mining K+1 itemsets must save all the K-frequent itemsets in memory, this is likely to exceed the memory capacity. Therefore, this thesis proposes a bottom-up depth-first strategy to improve the Pre Post algorithm.(4) Put forward a novel big data-mining algorithm based on Hadoop platform called MRPre Post, it to some extent compensates for the flaw of data mining algorithm under big data environment. A major factor affecting the performance of parallel algorithm is cluster load. In order to improve the MRPre Post algorithm performance, the thesis proposes a grouping strategy to ensure cluster load balance. Experiment shows that MRPre Post algorithm can adapt to big data association rules mining.
Keywords/Search Tags:Big Data, Data Mining, Association rules, Hadoop, PrePost, MRPrePost
PDF Full Text Request
Related items