Font Size: a A A

Research And Implementation Of Machine Learning Application Framework On Spark

Posted on:2016-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:K SunFull Text:PDF
GTID:2308330476453503Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Clustering and classification are the most important fields of machine learning. K-means algorithm is one of the most common used algorithms of clustering and Random Forests(RF) algorithm is one of the most common used algorithms of classification. However, K-means algorithm and RF algorithm both have some restriction and weakness. The parameter K, the number of clusters, of K-means algorithm must be set by the user. It is hard for normal users and the accuracy of the parameter from inexperienced users is dubious. RF algorithm cannot treat every decision tree discriminatively when votes to classify. It leads to reduction of the whole algorithm for some bad decision trees. Datasets to be analyzed have various problems in real scenario. Datasets with too many outliers may increase the number of iteration, increase the complexity, and decrease the accuracy of K-means algorithm. The accuracy may decrease and the error rate may increase when using RF algorithm to analyze datasets with noisy features and redundant features. The problems above raise the difficulty for users to adopt K-means algorithm and RF algorithm.On the other hand, machine learning frameworks based on distributed computing have been widely applied. However, existing machine learning frameworks request the user to have sufficient knowledge of machine learning algorithms for machine learning algorithms’ own restriction and weakness. It leads to higher barriers.Aiming at the above problems, this paper takes the project of transportation and logistics cloud computing platform construction from one province of our laboratory as the background. K-means algorithm of clustering, RF algorithm of classification and their restriction and weakness are studied, and two improved algorithms are proposed. A machine learning application framework based on Spark is designed and implemented. This framework has the following features: adaptive data preprocessing, adaptive algorithm optimization and adaptive parameter selection. The user needs not to concern about the underlying details of the algorithms when using it. In the end, this paper evaluates this framework by an example application from transportation and logistics field.Compared with other similar systems, the work from this paper has the following characteristics:First, an adaptive K-means algorithm(AKM) is proposed for the problems of K-means algorithm, which are inconsistent feature weight, outlier inference and selection of the parameter K. Experiments show that AKM algorithm can standardize the analyzed datasets, recognize outliers, and compute the parameter K automatically.Second, an adaptive RF algorithm(ARF) is proposed for the problems of RF algorithm, which are noisy features’ interference, redundant features’ interference and vote strategy. Experiments show that ARF algorithm can eliminate noisy features, eliminate redundant features, and select proper vote strategy according to the real issue.Third, an adaptive machine learning application framework based on Spark(AMLF) is designed and implemented backed with AKM algorithm and ARF algorithm. AMLF has the following features: unified data access interfaces, import and export of machine learning models, statistics and feedback of machine learning models. The Example application shows that the user needs not concern about the details of machine learning algorithms when using AMLF to develop applications and the barrier is depressed.
Keywords/Search Tags:Machine learning, K-means algorithm, Random Forests algorithm, Spark
PDF Full Text Request
Related items