Study Of Some Techniques In Data Mining Based On Spark

Posted on:2016-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Ning

Full Text:PDF

GTID:2308330470469328

Subject:Computer application technology

Abstract/Summary:

Distributed framework and parallel computing method got rapid development with the hot rise of large dataâ€™s concept. Big data brings about the growth in the volume of data and computational complexity increases while the traditional data mining approaches have stretched. The focus of current research is how to implement large data mining tasks quickly and accurately. Compared with the situation of that traditi onal process cannot solve the massive data mining and the efficiency of popular Hadoop platform processing for data mining task is low status in recent years, large data platform Spark is based on memory calculation, high flexibility, strong universality and has incomparable advantage on the data mining algorithm implementation. This paper focused on the two concepts of large data platform of Spark and data mining to do research in Spark data mining tasks, optimize and increase its algorithm. It also makes the test of accuracy, throughput and processing speed in practical application and proves the effectiveness of the work.It designs realization structure of association rules algorithm in Spark, and makes the Apriori algorithm come true on Spark platform. Do Multiple optimization to Apriori algorithm according to the characteristics of Spark and Apriori algorithm which makes the algorithm computes and carries out concurrently under the condition of huge amount of data and get the right results in a relatively short period of time. Moreover, the algorithm is applied in practice, and the throughput and processing time performance test proves the validity of the algorithm realization.It studies the implementation structure of classification algorithm based on Spark, improves the na?ve Bias classification algorithm, designs the implementation process on the Spark Streaming, and has realized the real-time classification problem of Streaming data. Finally it tests effectiveness of the algorithm in the garbage SMS classification.It does the research of clustering algorithm K-means in the Spark algorithm library Mllib and targets to improve the defects of that it can only identify the same size and convex clusters according to the realization principle of K-means. After that, the k-means algorithm can identify problem of different size between the larger clusters. At the same time, the improved algorithm is realized in Spark to make the clustering problem of consumption people get better clustering results finally.

Keywords/Search Tags:

Big Data, Distributed, Spark, Data Mining

Related items

1	Research And Implementation Of Unified Large Data Mining Service Platform Based On Spark MLlib
2	Research On Data Mining Technology Based On Spark
3	Study Of Some Techniques In Data Mining Based On Spark
4	The Research And Implementation Of Mining Large Data Based On Spark
5	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
6	Research On Distributed Frequent Itemset Mining Algorithm Based On Spark
7	A Frequent Serial Episode Mining Algorithm With Time Constraints Based On Spark Platform
8	Research Of Large-scale Data Mining Technology Based On Spark
9	Research On Association Mining Optimization Based On Spark Distributed And Application Of Comprehensive Decision
10	Design And Implementation Of A Distributed ETL Tool Using Spark