Research And Application On The Parallel Algorithm In Big Data Mining

Posted on:2016-02-02

Degree:Master

Type:Thesis

Country:China

Candidate:H Xie

Full Text:PDF

GTID:2308330473955890

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the arrival of the era of big data, the increase speed of the information has been explosive. How to mine the useful information from the massive data sets is the problem we are now facing. The traditional data mining algorithms have poor performance when dealing with big data. We can use the cluster parallel computing to deal with the big data when the single machine processing is inefficient. It will greatly improve the efficiency of the processing with the large data sets on multiple machines.Hadoop and Spark are the distributed processing frameworks developed by Apache, mainly used as a large data storage and distributed computing. HDFS has the ability of high throughput and high fault-tolerant in the file reading and writing. Spark and MapReduce provide a parallel programming model, so that users can finish distributed processing programs only by calling the associated APIs. These open source frameworks provide favorable conditions for the large data processing.This thesis mainly research on the parallelization of the data mining by combining the current popular distributed processing framework, such as hadoop and Spark.The main work is as follows:(1) The improvement of the parallel collaborative filtering algorithm: The existed parallel collaborative filtering algorithm based on co-occurrence matrix, will consume a lot of time in the construction of co-occurrence matrix and calculation of matrix multiplication. And it also ignores the role of neighbor users, so it will influence the accuracy of recommendation. In order to solve this problem, this thesis proposes the improved parallel collaborative filtering algorithm(ACF), and its implementation on spark. The experiment results show that, the improved parallel collaborative filtering algorithm in this thesis has better running efficiency and higher recommendation accuracy.(2) The parallel improvement of the association rules FP_Growth: The existed parallel PFP_Growth algorithm doesn’t consider the problem of load balance in the step of the FList grouping. In order to solve this problem, this thesis proposes the improved APFP_Growth algorithm according to the PFP_Growth algorithm, and it has better load balance performance. The experiment results show that, the improved APFP_Growth algorithm has a great scalability, and has better load balance performance than the PFP_Growth algorithm.(3) Designing and implementing a big data mining platform. The platform can complete the data mining and analysis work on the big data set. The functions this platform provided are as follows: data preprocessing, clustering, classification, recommendation, association rules and so on. The platform encapsulates the data mining algorithms based on hadoop and Spark. It also provides a flexible and configurable algorithm model and data model. According to the platform, users can complete the analysis work on the big data set easily and quickly.

Keywords/Search Tags:

Hadoop, Spark, data mining, collaborative filtering, association rules

PDF Full Text Request

Related items

1	Research And Application On Association Rules Mining Algorithm Base On Hadoop
2	The Research Of Quantitative Association Rules Data Mining Based On Hadoop
3	Research On Collaborative Filtering Recommendation Algorithm Based On Association Rule Optimization
4	Research And Implementation Of Mining Algorithm For Association Rules In Big Data Based On Hadoop
5	Mining Association Rules Algorithm Analysis Based On Hadoop
6	An Improved Algorithm Of Association Rules Based On The Spark
7	Research On Parallel Association Rules Algorithm Based On HADOOP Platform
8	Research On Algorithm And Application Of Big Data Association Rules Mining Based On Hadoop
9	The Research And Implementation Of Algorithm For Mining Association Rules Based On BigData
10	Research On Association Rules Algorithm Based On Hadoop