Font Size: a A A

Research And Application On The Parallel Algorithm In Big Data Mining

Posted on:2016-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:H XieFull Text:PDF
GTID:2308330473955890Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data, the increase speed of the information has been explosive. How to mine the useful information from the massive data sets is the problem we are now facing. The traditional data mining algorithms have poor performance when dealing with big data. We can use the cluster parallel computing to deal with the big data when the single machine processing is inefficient. It will greatly improve the efficiency of the processing with the large data sets on multiple machines.Hadoop and Spark are the distributed processing frameworks developed by Apache, mainly used as a large data storage and distributed computing. HDFS has the ability of high throughput and high fault-tolerant in the file reading and writing. Spark and MapReduce provide a parallel programming model, so that users can finish distributed processing programs only by calling the associated APIs. These open source frameworks provide favorable conditions for the large data processing.This thesis mainly research on the parallelization of the data mining by combining the current popular distributed processing framework, such as hadoop and Spark.The main work is as follows:(1) The improvement of the parallel collaborative filtering algorithm: The existed parallel collaborative filtering algorithm based on co-occurrence matrix, will consume a lot of time in the construction of co-occurrence matrix and calculation of matrix multiplication. And it also ignores the role of neighbor users, so it will influence the accuracy of recommendation. In order to solve this problem, this thesis proposes the improved parallel collaborative filtering algorithm(ACF), and its implementation on spark. The experiment results show that, the improved parallel collaborative filtering algorithm in this thesis has better running efficiency and higher recommendation accuracy.(2) The parallel improvement of the association rules FP_Growth: The existed parallel PFP_Growth algorithm doesn’t consider the problem of load balance in the step of the FList grouping. In order to solve this problem, this thesis proposes the improved APFP_Growth algorithm according to the PFP_Growth algorithm, and it has better load balance performance. The experiment results show that, the improved APFP_Growth algorithm has a great scalability, and has better load balance performance than the PFP_Growth algorithm.(3) Designing and implementing a big data mining platform. The platform can complete the data mining and analysis work on the big data set. The functions this platform provided are as follows: data preprocessing, clustering, classification, recommendation, association rules and so on. The platform encapsulates the data mining algorithms based on hadoop and Spark. It also provides a flexible and configurable algorithm model and data model. According to the platform, users can complete the analysis work on the big data set easily and quickly.
Keywords/Search Tags:Hadoop, Spark, data mining, collaborative filtering, association rules
PDF Full Text Request
Related items