Font Size: a A A

Study Of Some Techniques In Data Mining Based On Spark

Posted on:2016-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y H NingFull Text:PDF
GTID:2308330470469328Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Distributed framework and parallel computing method got rapid development with the hot rise of large data’s concept. Big data brings about the growth in the volume of data and computational complexity increases while the traditional data mining approaches have stretched. The focus of current research is how to implement large data mining tasks quickly and accurately. Compared with the situation of that traditi onal process cannot solve the massive data mining and the efficiency of popular Hadoop platform processing for data mining task is low status in recent years, large data platform Spark is based on memory calculation, high flexibility, strong universality and has incomparable advantage on the data mining algorithm implementation. This paper focused on the two concepts of large data platform of Spark and data mining to do research in Spark data mining tasks, optimize and increase its algorithm. It also makes the test of accuracy, throughput and processing speed in practical application and proves the effectiveness of the work.It designs realization structure of association rules algorithm in Spark, and makes the Apriori algorithm come true on Spark platform. Do Multiple optimization to Apriori algorithm according to the characteristics of Spark and Apriori algorithm which makes the algorithm computes and carries out concurrently under the condition of huge amount of data and get the right results in a relatively short period of time. Moreover, the algorithm is applied in practice, and the throughput and processing time performance test proves the validity of the algorithm realization.It studies the implementation structure of classification algorithm based on Spark, improves the na?ve Bias classification algorithm, designs the implementation process on the Spark Streaming, and has realized the real-time classification problem of Streaming data. Finally it tests effectiveness of the algorithm in the garbage SMS classification.It does the research of clustering algorithm K-means in the Spark algorithm library Mllib and targets to improve the defects of that it can only identify the same size and convex clusters according to the realization principle of K-means. After that, the k-means algorithm can identify problem of different size between the larger clusters. At the same time, the improved algorithm is realized in Spark to make the clustering problem of consumption people get better clustering results finally.
Keywords/Search Tags:Big Data, Distributed, Spark, Data Mining
PDF Full Text Request
Related items