Font Size: a A A

Research On The Technology Of Mass Data Parallel Mining

Posted on:2015-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:F F SunFull Text:PDF
GTID:2268330425488968Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Data Mining is the period of picking up unknown and interested knowledge from large amount of data by special algorithm.In the age of network information, the data growth is explosive, the traditional serial algorithm is inefficient when processing mass data, and how to improve the efficiency of massive data mining becomes an urgent problem, the parallel data mining is an effective way to solve this problem. Incremental mining is another thought to improve mining efficiency, which use existing knowledge to mining the data set that have been updated.MapReduce is a programming model simple which is proposed by Google and can process mass data with the distributed parallel mode. Compared with other parallel programming models, it’s unnecessary to consider some issues such as the division of data, the allocation of data and scheduling in the process of programming, and can handle nodes failure in the cluster.Association rules has been widely used in electronic commerce, medical diagnosis, weather forecast, Banks, telecommunications and other industries。It has always been hot topics in the study of data mining. In this paper, founding the frequent item in the association rules sets is the starting point, in order to improve the efficiency of finding frequent items in mass data, the parallel and incremental association rule mining algorithm are studied on the basis of the MapReduce.Firstly, the thesis analyzes the association rules algorithm and finds out the shortage of Apriori algorithm. Combining with the logic operation of vectors, the algorithm is improved in three major areas:scanning frequency and the method to generate candidate itemsets and transaction compression. Then, an improved association rules algorithm Apriori_M is devised。Secondly, MapReduce parallel programming model is deeply analyzed. In order to improve the capacity of processing massive data capacity of Apriori_M algorithm, the parallel improvement ideas of algorithm is put forward based on the idea of Partition and implemented with MapReduce programming model.Thirdly, the thesis researches the incremental association rules mining algorithm. Two kinds of parallel incremental association rules mining algorithm are proposed on the basis of the FUP algorithm which can deal with the data set which is dynamically added. The whole algorithm can be divided into generating candidate items and verifying candidate items. MFUP1algorithm sequently generates candidate itemsets, and then parallelly elects frequent itemset from them, which is suitable for the new data set on a smaller scale. However, MFUP2algorithm parallelly generates candidate itemsets, and then parallelly verifies which is frequent, which is suitable for the new data set larger on a larger scale (Combining with with the original data set, it’s still small).Finally, the thesis tests the performance of the parallel association rules algorithm and parallel incremental mining algorithm which are proposed on MapReduce. To verify the performance of the algorithm, the simulation platform which is constructed by the open source Hadoop cloud platform is used to analysis the algorithm. The experimental results show that the parallel Apriori_M algorithm, MFUP1and MFUP2algorithm can efficiently find frequent itemsets from mass data, therefore the improved algorithm is feasible and effective.
Keywords/Search Tags:Mass data, Parallel mining, asocciation rules, Incremental mining
PDF Full Text Request
Related items