Font Size: a A A

Parallel Research And Application Of Machine Learning Algorithm Based On Cloud Platform

Posted on:2017-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:F F YuanFull Text:PDF
GTID:2348330485971364Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of information technology, data has become the most precious resource, and data from all walks of life grow exponentially, including all kinds of business data from e-commerce site and the bank, all sorts of genomic data, and so on. With this kind of explosive growth, it is hard to get effective treatment in the existing platform. At present, the Hadoop platform is a relative efficiency parallel new technology of digging out the useful information in the large dataset, using MapReduce (MR) programming framework, the greater amount of data, the unique advantage of this technology is more obvious. Mahout is machine learning algorithm library belonging to Apache which is open source code. Mahout based on Hadoop platform and MR computing framework provides efficient algorithm instance for application developers. But most of ML algorithms are iteration compute, while MR store intermediate data on HDFS which has high resource consumption. Due to the defect of Mahout, Spark computing framework is born at right time. Spark is based on the elastic distributed dataset (RDD) which is an abstract concept of distributed memory, reducing the cost of I/O resource and fault tolerance. Spark can also build on the platform of Hadoop YARN, and distributed data storage. With Spark MLlib emerges, the machine learning algorithm parallelization experiences a qualitative improvement. In this paper, we research things based on Spark MLlib like clustering algorithm K-means and classification algorithm decision tree and its assemble-tree random forest to solve the problem of genomic data that single machine cannot process. As the first step in data processing, K-means algorithm is used to find the best number of categories. In the second step, random forest classification algorithm is applied to train a model for subsequent label prediction based on existing classes. The research of algorithm in the paper is mainly used on the analysis of genomic data, but is not limited to this. ML algorithms based on cloud platform and Spark framework have good scalability. Experimental results show that ML algorithms based on Spark can effectively improve the analysis of the large genomic data which will play a positive role to promote scientific research on genomic data.
Keywords/Search Tags:Cloud Computing, Spark, K-means, Decision Tree, Random Forest, Data Mining
PDF Full Text Request
Related items