Research On Stream Data Classification Algorithm Based On STORM

Posted on:2017-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:F Y Zhang

Full Text:PDF

GTID:2308330488497130

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with the rapid expansion and wide application of technology in the domain of network and sensor, an increasing number of stream data are generated. Therefore, the data mining technology aiming at stream data is gradually rising. In concrete terms, it refers to the data handling techniques which may effectively mine valuable information from the large-scale, fast and heterogeneous data sources.This thesis deals with classification algorithm for data mining facing stream data, which may increase the efficiency and precision of classification mining algorithm. Specifically,it not only describes the improvement of the algorithm itself, but also discusses distributive and parallelized algorithm based on processing platform Storm for stream data.For purpose of improving the time efficiency of classification miningof real-time online stream data, the VFDT(Very Fast Decision Tree) algorithm is deployed to stream data computing platform Storm. Additionally, a scheme of distributed parallel implementing VFDT algorithm on Storm platform is designed in this thesis. The functions of each module are realized by correctly designing the Spout/Bolt of Storm Topology, and the parallelization of the classification module is realized by deploying multiple tasks for Classification Bolt. The memory database Redis is used to realize the effective connection of the three modules and the preservation of the decision tree. The message middleware Kafka is used to improve the tolerance of burst stream data. The results of implementing and testing VFDT algorithm based on the proposed scheme show that the classification efficiency of VFDT algorithm under the Storm cluster environment is significantly improved compared with that under the single machine environment, and the classification efficiency can be further improved by reasonably setting the Task number in Classification Bolt.For high-dimensional data sets, in order to further improve the time efficiency of building the classification model of the online stream data, VFDT algorithm is implemented to realize vertical parallel operation and the VPVFDT(Vertical Parallelism Very Fast Decision Tree) is designed. In this algorithm, the attribute information gain of the VFDT algorithm is calculated to do parallel processing, so as to improve the efficiency of sample processing. On this basis, the VPVFDT algorithm is deployed to the stream data computing platform Storm, which may not onlyfurther improve the time efficiency of the algorithm but also enhance the expansibility of the algorithm. The experimental results of the VPVFDT algorithm based on the proposed scheme show that the VPVFDT algorithm can improve the processing efficiency of the high dimensional training samples to a certain extent in the Storm cluster environment.In order to improve the classification accuracy of VFDT algorithm, Random Forest algorithm is integrated into the process of building tree of VFDT algorithm, and a Random Forest Based Very Fast Decision Tree algorithm named RFVFDT is designed. RFVFDT algorithm adopts the decision tree building criterion of Random Forest classifier, and improves Random Forest algorithm with sliding window to meet the unboundness of data stream and avoid process delay the and data loss. The result of the emulation experiment based on Storm platform has shown that the RFVFDT algorithm has advantages over classification accuracy and scalability aspects.The schemes and algorithms studied in this thesis can adapt to the features of real-time, rapidity, unlimitedness, succession of large-scale stream data. Furthermore, the research content is relatively advanced and the study results are of theoretical value and better practicality. The research results can be used for e-commerce, Internet and other stream data application scenarios.

Keywords/Search Tags:

stream data, classification, Very Fast Decision Tree, distributed parallelization, random forest

PDF Full Text Request

Related items

1	Research On Imbalanced Data Classification Algorithm Based On Random Forest And Its Parallelization
2	Research On Decision Tree Classification Algorithm Parallelization Based On Big Data Platform
3	Research And Application Of High Dimensional Imbalanced Data Classification Based On Random Forest
4	Research On Multi-specification Cargo Loading Based On Improved Random Forest Algorithm
5	Research On Efficient Parallelization Of Improved Random Forest Algorithm Based On Spark Platform
6	Parallel Ordinal Decision Tree And Decision Forest Based On MapReduce
7	Research On Parallel Text Categorization Of Random Forest
8	Research And Application Of Decision Tree Algorithm In The Classification Of Bank Personal Credit Users
9	Application Of Machine Learning Algorithms In Total Housing And Classification Statistics
10	The Improved Random Forests Based On The Imbalanced Data Classification