Font Size: a A A

Research And Application Of Streaming Data Integration Classification Method Based On Spark

Posted on:2019-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:J G ZhangFull Text:PDF
GTID:2428330548986989Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology and mobile Internet,the volume of data in the world has increased rapidly.People have entered the era of big data,and the ways of people's life and production have changed radically.Traditional static data mining technology has been unable to meet the needs of many applications.The data form is changed from the traditional small amount of static data into the mass dynamic data.The streaming data is one of the most important data form in the concept of large data.It has many characteristics different from the static data,such as time-varying,real-time,massive amount and so on.It has brought many challenges to the data mining algorithm.For streaming data,how to design an excellent classification algorithm according to its characteristics and goals has become a hot topic in academic circles.In this paper,the integrated classification method of flow data based on concept drift is studied based on the characteristics of flow data and new machine learning theory.Based on the popular large data computing framework,this paper implements a flow data integration classification method based on Spark,and then applies the algorithm to the classification application of large-scale network traffic.According to the basic characteristics of the streaming data and the advantages of integrated learning,the following work has been done in this paper.(1)In this paper,a new integrated classification model OC-WE algorithm is proposed to deal with the problem of flow data classification.In general,the method trains the flow data blocks,each category of the data is trained based on the base classifier respectively.The data blocks are updated in real time in order to deal with the concept shift according to an optimization strategy;In the integrated classification model,this method updates the base classifier by updating the weight of the classifier matrix to improve the accuracy and efficiency of classification.(2)In order to improve the parallelization ability of the algorithm,the integrated classification model is implemented on the existing large data processing platform Spark for the characteristics of mass and real-time,and the pertinence adjustment is made to improve the parallelization ability of the algorithm.(3)The parallel algorithm is applied to the classification of large-scale network traffic.An online solution for the classification of large-scale network traffic is proposed from the aspects of data preprocessing and simple feature extraction.The advantage of using OC-WE flow data integration classification algorithm to solve this problem is illustrated.The main innovations of this paper are as follows:(1)when training classifier,we use an optimized neighbor strategy to adapt to the concept drift of stream data.(2)integrated classification model matrix uses a new update strategy and matrix arrangement to improve the classification efficiency.(3)combined with big data computing framework,the proposed algorithm is parallelized on Spark Streaming,which satisfies the application scenario of big data.(4)for the classification and application of network traffic,we propose an innovative online solution from data preprocessing and simple feature extraction.
Keywords/Search Tags:Streaming data, Classification, ensemble learning, Spark
PDF Full Text Request
Related items