Font Size: a A A

The Study Of Classification Algorithm In Data Mining

Posted on:2018-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:R Y WangFull Text:PDF
GTID:2348330518996843Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Data mining is a method of extracting knowledge from a large amount of data. With the advance of the global information tide, as well as the prosperity of Internet and Internet of Things, people are surrounded by vast amounts of data. Facing the large-scale and rapidly growing data, if there is no powerful data analysis tools, understanding and using these data is very difficult. Therefore, in the era of big data, people have higher requirements for data mining technology. In the context of big data,classification algorithms in data mining still have two problems. First of all, labeling examples is time-consuming and arduous, despite the fact that collecting examples is easier. So obtaining enough labeled examples to learn model which has high generalization ability is difficult. Secondly, it is impossible to load huge data sets into memory. In the traditional serial mode, the response time of training and classifying data is unacceptable.In this paper, the classification algorithm in data mining is studied. The completed works are as follows:(1) In this paper, Co-training by Committee algorithm in semi-supervised learning is studied, and an improved Co-training by Committee algorithm which has higher classification accuracy is proposed. In the process of iteration, in order to guarantee the correctness of labeled examples which are added into labeled data set, this paper proposes a method of using all the previous classifiers to predict the label of unlabeled examples, and introducing the data editing method to estimate the labeling confidence. Simulation results show that compared with the Co-training by Committee algorithm, the proposed algorithm can improve the classification accuracy by about ten percentage points.(2) This paper designs a parallel implementation of the algorithm, and deploys the improved algorithm in the Hadoop distributed computing platform. In this paper, the steps of training classifiers and classifying examples can be paralleized. The corresponding MapReduce program is compiled and combined with the iterative framework, so that the algorithm can run on the computer cluster in parallel. Simulation results on large-scale data sets show that the improved algorithm still has the advantage of classification accuracy. The classification of network traffic data in real life on Hadoop can verify the practicability of algorithm.In this paper, the study of classification algorithm in data mining can effectively use a large number of unlabeled examples to learn model which has good generalization performance. By making use of Hadoop, the improved algorithm can handle large-scale data, and has the effectiveness and practicality.
Keywords/Search Tags:data mining, semi-supervised learning, co-training by committee, hadoop
PDF Full Text Request
Related items