Font Size: a A A

Research And Application Of Online Machine Learning Algorithm In Big Data Environment

Posted on:2018-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:X W ZhuFull Text:PDF
GTID:2348330518999228Subject:Control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, the value of big data attracts more and more attention. How to analyze the big data efficiently has become an important topic.Machine learning is one of the common methods of data analysis, and the traditional machine learning algorithms are often designed for off-line batch training. However, it is difficult to apply these methods into big data environment with massive and sustained growth of the data set. How to develop and transform the current machine learning algorithm to serve the big data environment has become a current research hotspot.The main content of this thesis can be summarized as follow.First of all, this paper introduces and analyzes the most widely used technology in two fields; big data and machine learning. For the big data technology, the paper introduces the ecology of Hadoop and analyzes the principles of HDFS, MapReduce and YARN. It introduces Spark's BDAS environment and analyzes the principle of RDD. It briefly summarizes three commonly used big data tools: Flink, Storm and TensorFlow. For the machine learning technology, the principles of supervised learning algorithm (logistic regression, support vector machine), unsupervised learning algorithms (K-means), and reinforcement learning (Q-learning) were discussed. The generalization ability of machine learning and the evaluation index of classifier were summarized. The work of this part provides a theoretical basis for transformation.For the second part, the reform of machine learning algorithms are studied. In the online part of the traditional algorithm, it is pointed out that the key to online is the refom of training methods. In supervised learning gradient descent method and Newton method is used for online, and the mini-batch gradient descent method and online BFGS method are obtained. In the transformation part of large data environment, the method of parameter updating based on sample dimension and feature segmentation based on feature dimension is deduced, and the two kinds of segmentation methods are compared. In the work of this part, the theoretical analysis of the machine learning algorithm online and parallel transformation is completed.For the third part, the design and construction of big data experimental platform are carried out. For the big data experiment platform, three schemes are given and the advantages and disadvantages are analyzed and compared, and thereby a scheme utilizing Hadoop Streaming and Python is selected. In the section of realization, the detailed process of building the Hadoop platform in the virtual machine and the development tools used in the platform are given. The core algorithm is described and pseudo-code and key function of the algorithm is defined. The system logic is realized and solutions to continuous training problems are proposed. This part of thesis is the preparation for the next algorithm experimentsFinally, online machine learning is applied into the classification of power quality. A brief introduction about the power quality problem is given. After analysis for obtaining of power quality data, the method and process of generation of power quality is described. The approach and steps for extracting features of sample power quality data is explained. Then, the classification effect of logistic regression and support vector machine in traditional machine learning algorithm is analyzed, and the conclusion is that the classification performance of different algorithms will be consistent when the amount of data is enough. The influence of two hyper parameters on the online support vector machine training process is analyzed, and the conclusion is that the hyper parameters of the training method have little influence on the training process when the amount of data is large. The training time and classification effect of online support vector machine and offline support vector machine is analyzed, and the conclusion is that online support vector machine relative to the offline support vector machine classification effect slightly weak, but the training time is greatly reduced. The training time of online support vector machine under different number of computing nodes is analyzed, and the conclusion is that it can reduce the training time when the number of data nodes is increased. The content of this part verifies that the online machine learning algorithm is more usability than the traditional machine learning algorithm in the large-scale machine learning problem, and also verifies the feasibility of the online machine learning application in the power quality disturbance problem.
Keywords/Search Tags:Big Data, Online Machine Learning, Gradient Descent, Data Partitioning, Power Quality Disturbance
PDF Full Text Request
Related items