The Research And Implementation Of Mining Large Data Based On Spark

Posted on:2016-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:W D Li

Full Text:PDF

GTID:2308330461985252

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

We have entered the big data’s age, the thing that choosing a big data processing platform to deal large data effectively, mine and research in high efficiency becomes crucial. Generally Sparking, it is an open source parallel distributed computing framework, suitable for the analysis of various iterative algorithms and interactive data, can provide high data processing in real-time and accuracy, but also guarantee the high fault tolerance and scalability. At present, there have been widely use of Spark on many companies in the real production environment, such as foreign Amazon, eBay, Yahoo and other companies and domestic Taobao, Baidu, Huawei, Youku potatoes company.Although the Spark framework in the actual industrial production and practice has been widely used, but is restricted to the Spark framework was born late, the young version and other factors. At this stage, using the Spark framework for data mining and analysis of big data in some specific application scenarios need to combine the original module and rewrite and add some functions, in order to play its role better. Until on some of the big data mining problems, the Spark framework is still no corresponding function. For example, distributed machine learning repository MLlib temporarily haven’t distributed machine to solve the problem of association analysis algorithm.The main work of this paper is:(1)In order to build the test environment, build a Spark on Yarn model, which is seted up by a Master node and multiple workers Spark cluster nodes.We used the Linux Operating system on the cluster to veryfy the experimental data of this paper in the cluster and test algorithm and system. In order to be more convenient to develop and text the Spark application. Build Scala coding environment of IDEA, and the preliminary debugging and application in stand-alone mode of code generation, and running on clusters jar package. (2)To scenarios to achieve collaborative filtering recommendation in large-scale data processing, this article USES the Scala language, Java language and Spark RDD and call MLlib MatrixFactorizationModel module and ALS module, in the realization of distributedcollaborative filtering recommendation. In this paper, implementation of a distributed parallel collaborative filtering recommendation. It is a recommendation system model based on different parameters, can be training model, focused and in check verification, to obtain the optimal parameters of the model, and carries on the test set prediction score and give users recommend using the best model. (3)In order to achieve in large-scale data processing scenarios correlation analysis, this paper mainly using the Scala language and Spark RDD distributed operator on the classic Apriori algorithm for distributed parallel encoding processing. Big data set and use the GB level set is tested and verified on a Spark cluster in Comparison of operating efficiency and results of between single Apriori algorithm coding by the Java language and the distributed parallel Apriori algorithm.This paper’s contribution is:(1) Solving the problem about Spark on Yarn cluster structures and cluster scale expansion. (2) Providing a collaborative filtering recommendation in the distributed cluster parallel implementation scheme. (3) Realizing the distributed parallel Apriori algorithm which the Spark machine learning repository MLlib distributed temporarily has not given associated distributed algorithm is proposed to analyze the problem solution.The contributions provides Collaborative filtering and distributed parallel correlation a feasible solution of using analysis of problem in large data background. Thus it enrichs and improves Spark large data frame in the specific application scenarios mining ability.

Keywords/Search Tags:

Big data, Spark, Distributed Computing, Collaborative Filtering, Apriori

PDF Full Text Request

Related items

1	A Study On Spark-based Distributed Collaborative Filtering And Its Tools
2	Research On Collaborative Filtering Recommendation Algorithm Based On Spark And System Implementation
3	Scalable Solution Of Collaborative Filtering Algorithm Based On Dimension And Distributed Computing
4	Research On Improved Distributed Collaborative Filtering Recommendation Algorithm
5	Research And Implementation Of Collaborative Filtering Recommendation System Based On Spark Big Data Processing
6	Enhanced Singular Collaborative Filtering Based Recommender System On Apache Spark
7	An Item-based Collaborative Filtering Recommendation Algorithm Optimization And Parallel Implementation On Spark Platform
8	A System For Distributed MD Data Analysis Based On Spark
9	Research And Implementation Of Collaborative Filtering Recommendation System Based On Spark Large Data Processing
10	Research On Hierarchical Collaborative Filtering Algorithm With Spark Platform