Along with the rapid expansion and wide application of technology in the domain of network and sensor, an increasing number of stream data are generated. Therefore, the data mining technology aiming at stream data is gradually rising. In concrete terms, it refers to the data handling techniques which may effectively mine valuable information from the large-scale, fast and heterogeneous data sources.This thesis deals with classification algorithm for data mining facing stream data, which may increase the efficiency and precision of classification mining algorithm. Specifically,it not only describes the improvement of the algorithm itself, but also discusses distributive and parallelized algorithm based on processing platform Storm for stream data.For purpose of improving the time efficiency of classification miningof real-time online stream data, the VFDT(Very Fast Decision Tree) algorithm is deployed to stream data computing platform Storm. Additionally, a scheme of distributed parallel implementing VFDT algorithm on Storm platform is designed in this thesis. The functions of each module are realized by correctly designing the Spout/Bolt of Storm Topology, and the parallelization of the classification module is realized by deploying multiple tasks for Classification Bolt. The memory database Redis is used to realize the effective connection of the three modules and the preservation of the decision tree. The message middleware Kafka is used to improve the tolerance of burst stream data. The results of implementing and testing VFDT algorithm based on the proposed scheme show that the classification efficiency of VFDT algorithm under the Storm cluster environment is significantly improved compared with that under the single machine environment, and the classification efficiency can be further improved by reasonably setting the Task number in Classification Bolt.For high-dimensional data sets, in order to further improve the time efficiency of building the classification model of the online stream data, VFDT algorithm is implemented to realize vertical parallel operation and the VPVFDT(Vertical Parallelism Very Fast Decision Tree) is designed. In this algorithm, the attribute information gain of the VFDT algorithm is calculated to do parallel processing, so as to improve the efficiency of sample processing. On this basis, the VPVFDT algorithm is deployed to the stream data computing platform Storm, which may not onlyfurther improve the time efficiency of the algorithm but also enhance the expansibility of the algorithm. The experimental results of the VPVFDT algorithm based on the proposed scheme show that the VPVFDT algorithm can improve the processing efficiency of the high dimensional training samples to a certain extent in the Storm cluster environment.In order to improve the classification accuracy of VFDT algorithm, Random Forest algorithm is integrated into the process of building tree of VFDT algorithm, and a Random Forest Based Very Fast Decision Tree algorithm named RFVFDT is designed. RFVFDT algorithm adopts the decision tree building criterion of Random Forest classifier, and improves Random Forest algorithm with sliding window to meet the unboundness of data stream and avoid process delay the and data loss. The result of the emulation experiment based on Storm platform has shown that the RFVFDT algorithm has advantages over classification accuracy and scalability aspects.The schemes and algorithms studied in this thesis can adapt to the features of real-time, rapidity, unlimitedness, succession of large-scale stream data. Furthermore, the research content is relatively advanced and the study results are of theoretical value and better practicality. The research results can be used for e-commerce, Internet and other stream data application scenarios. |