Font Size: a A A

Realization Of Machine Learning Classification Algorithms In The Hadoop Development Environment

Posted on:2019-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:B HuiFull Text:PDF
GTID:2428330572955850Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the field of machine learning,classification algorithms are widely used in various applications,such as risk management,user portraits,image recognition.The most common classification algorithms include the K nearest neighbor algorithm,logistic regression algorithm and BP(Back Propagation)neural network algorithm.However,these three classification algorithms have critical requirements for memory,data transmission,and data storage when dealing with massive data.General commercial computer equipment cannot meet the requirements of massive data processing and analysis.Hadoop is a distributed computing framework.It is the best choice for massive data,offline,non-real-time processing and analysis.The traditional machine learning classification algorithm combining with the Map Reduce computing model on the Hadoop platform,realize the classification algorithms in the Hadoop development environment.Those algorithms can be applied to the application scene of massive data processing and analysis.The HDFS(Hadoop Distributed File System)on the Hadoop platform provides a solution for massive data storage.This thesis aims to realize the classification algorithms in the Hadoop development environment and the main results of this research are as present as follows.(1)In order to solve the problem of excessive data transmission in the middle of existing algorithms,the existing realization of K nearest neighbor algorithm in the Hadoop development environment is optimized.In the Map phase,the traditional K nearest neighbor algorithm and the training instances on each Map node are used to obtain the label of each test instance.In the Reduce phase,the majority voting algorithm is used to obtain the final label prediction of each test instance.For the determination of hyperparameter K value,a method for iterating different K values in a certain value space is used to select the K value that makes the performance of the algorithm optimal.For the determination of hyperparameter distance metrics,a variety of distance metrics are tested by a control variable method,a distance metric that can optimize the generalization ability of the algorithm is selected.(2)In order to solve the problem that existing algorithms can only output model's parameters,the existing realization of logistic regression algorithm in the Hadoop development environment is optimized.In the Map phase,the traditional logistic regression algorithm and the training instances on each Map node are used to obtain the basic classifier.The basic classifier is used to predict the label of the test instances.In the Reduce phase,the output of the basic classifier in the Map node are averaged to obtain the final label prediction result of the test instances.(3)In order to improve the operating efficiency of existing algorithms,the existing realization of BP neural network algorithm in the Hadoop development environment is optimized.The improved algorithm use the output error threshold and the iterative maximum value as the termination iteration conditions when training the local network in the Map phase.In the Reduce phase,the iteration maximum value is used as the termination iteration condition to control the global iteration.The final network model parameters are obtained by calculating the average value of the network model parameters of each Map node several times.(4)This thesis summarizes and analyzes the existing realization of classification algorithms in the Hadoop development environment.Multiple data sets are used to compare the similar algorithm with the realization of the K nearest neighbor algorithm,the logistic regression algorithm and the BP neural network algorithm in the Hadoop development environment which are proposed in this thesis.All algorithms are analyzed for running time,generalization ability,speedup,etc.The experimental results show that the algorithms proposed in this thesis have good generalization ability and operating efficiency which can be used for the prediction of massive data.
Keywords/Search Tags:Hadoop, Machine Learning Classification Algorithm, K Nearest Neighbor Algorithm, Logistic Regression Algorithm, BP Neural Network Algorithm
PDF Full Text Request
Related items