Font Size: a A A

Research On Decision Tree Classification Algorithm Parallelization Based On Big Data Platform

Posted on:2018-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y P ZhangFull Text:PDF
GTID:2348330536479922Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of database,Internet of things and all kinds of information technology,more and more data generated from various industries,such as telecom operators,securities and banking and Internet terminals.Vast amounts of data grow explosively,we can not ignore how to get more value in the big data.It is urgent to deal with massive data.The main features of big data are: mass(volume),high speed(velocity),accurate(veracity)and diversity(variety),In the initial stage of the development of big data technology,the main emphasis of domestic and foreign experts is massive data processing and processing of various data types.However,in the current era of Internet,big data mostly exists in financial stocks,network traffic of the operators,real-time requests of the website,traffic data stream,data is mostly transferred by high-speed data stream form.Different from the static data stored in the traditional database,the streaming data,as a new data form,is more strict to the speed and accuracy of the data analysis process.For the analysis and processing of streaming data,we need to quickly record the information of real-time data stream and ensure the timeliness of information more accurately.This thesis carried out in-depth investigation and analysis of the above problems,adequately researched the features and advantages of the streaming data processing platform and method,then put forward a stream data concept drift detection algorithm and a parallel decision tree classification algorithm under the environment of big data,mainly used to detect and process the hidden concept drift of unsteady data stream.On the basis of the proposed P-HT parallel decision tree classification algorithm,this thesis designs a parallel modeling algorithm of streaming data based on distributed stream processing platform and a real-time classification evaluation framework.This thesis firstly proceed incremental improvements of the traditional classification algorithm to adapt to the demand of streaming data processing,secondly,put forward the ADDS concept drift detection algorithm and the parallel P-HT decision tree classification algorithm based on Storm according to the characteristics of streaming data.Finally,the two algorithms are analyzed respectively.The experimental results show that the ADDS algorithm has a better concept drift detection effect,and the P-HT decision tree classification algorithm has higher efficiency and anti-concept drift performance.
Keywords/Search Tags:big data, stream computing, classification algorithms, Storm, P-HT
PDF Full Text Request
Related items