Parallel Research And Application Of Machine Learning Algorithm Based On Cloud Platform

Posted on:2017-10-22

Degree:Master

Type:Thesis

Country:China

Candidate:F F Yuan

Full Text:PDF

GTID:2348330485971364

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the era of information technology, data has become the most precious resource, and data from all walks of life grow exponentially, including all kinds of business data from e-commerce site and the bank, all sorts of genomic data, and so on. With this kind of explosive growth, it is hard to get effective treatment in the existing platform. At present, the Hadoop platform is a relative efficiency parallel new technology of digging out the useful information in the large dataset, using MapReduce (MR) programming framework, the greater amount of data, the unique advantage of this technology is more obvious. Mahout is machine learning algorithm library belonging to Apache which is open source code. Mahout based on Hadoop platform and MR computing framework provides efficient algorithm instance for application developers. But most of ML algorithms are iteration compute, while MR store intermediate data on HDFS which has high resource consumption. Due to the defect of Mahout, Spark computing framework is born at right time. Spark is based on the elastic distributed dataset (RDD) which is an abstract concept of distributed memory, reducing the cost of I/O resource and fault tolerance. Spark can also build on the platform of Hadoop YARN, and distributed data storage. With Spark MLlib emerges, the machine learning algorithm parallelization experiences a qualitative improvement. In this paper, we research things based on Spark MLlib like clustering algorithm K-means and classification algorithm decision tree and its assemble-tree random forest to solve the problem of genomic data that single machine cannot process. As the first step in data processing, K-means algorithm is used to find the best number of categories. In the second step, random forest classification algorithm is applied to train a model for subsequent label prediction based on existing classes. The research of algorithm in the paper is mainly used on the analysis of genomic data, but is not limited to this. ML algorithms based on cloud platform and Spark framework have good scalability. Experimental results show that ML algorithms based on Spark can effectively improve the analysis of the large genomic data which will play a positive role to promote scientific research on genomic data.

Keywords/Search Tags:

Cloud Computing, Spark, K-means, Decision Tree, Random Forest, Data Mining

PDF Full Text Request

Related items

1	Thermal Power Plant Energy Saving Analysis Based On Spark Big Data Platform
2	Research On Parallel Random Forest And Fuzzy C-Means Algorithm For Imbalanced Data
3	Research On Spark Data Skewing Improvement And Decision Tree Parallelization Application Under Cloud Edge Collaboration
4	Research And Application Of Decision Tree Algorithm In The Classification Of Bank Personal Credit Users
5	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
6	The Research On Classification And Regression Tree's Parallelization Based On Spark Platform
7	The Application Of Data Mining In Smart Phone Sales Data
8	Research On Code Plagiarism Detection Model Based On Random Forest And Gradient Boosting Decision Tree
9	Based On Decision Tree Incremental Learning Imaging Target Classification Technology Research
10	Research On Data Mining Technology Based On Spark