Research On Data Stream Ensemble Classifiers

Posted on:2012-02-13

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X F Yang

Full Text:PDF

GTID:1118330368982909

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development and application of information technology, people can collect lots of high-speed, dynamic and continuous information, such as sensors network data, telephone records, financial data and commercial transaction data etc. Traditional static data set s the information carrier has already been unable to efficiently express such information a. Therefore data stream as a kind of new data type is put forward and widely used in the above fields. Data stream is a kind of orderly data sequences which can continuously arrive in and potentially infinitely be input in. Compared with the traditional static data set, data stream has the following features:(1) Data reach with rapid speed. (2) Large-scale data. (3) Data stream is orderly sequences. (4) Data stream can dynamic change. (5) Data stream often is high-dimensional. The above features let data stream cannot be delt with effectively by the traditional data mining classification algorithm.So the research of data stream mining algorithm become one of the hot spots in the data mining area.We focused on classifying data stream using ensemble classifiers in this paper. From two aspects of training individual classifier and integrating classified outcome, we studied noisy data stream, high speeding data stream and data stream without complete label. Main job is as follows:First of all, aiming at the promble of ensemble classifiers classification accuracy is influenced seriously by noise which is trained by noisy data set, a cross validation noise-tolerance data stream ensemble classifiers algorithm was proposed. Cross validation noise-tolerance classification algorithm is an important method which eliminates noise of data set. That can eliminates noise of training data set before training classifier, So classifier classification accuracy be able to increased significantly. However, there have been not scholars to prove it validity in theory. According to sample complexity theory of noisy data set, algorithm validity was proved in this paper. And according to outcome of proving, a new Cross validation noise-tolerance classification algorithm which deal with data stream was proposed. It can further increase classification accuracy of classifier which deal with noisy data set. Secondly, aiming at high speed data stream existing the phenomenon that the data rate is higher relative to the ensemble classifiers'computational power, so ensemble classifiers can't train all data to update themselves. An ensemble classifiers based on biased sample was proposed. Sampling technique can effectively reduce to data scale, so it can decrease time of training and updating ensemble classifiers. However, training differnet ensemble classifiers using data set produced by different sampling strategy, their classification performance has obvious difference. Therefore, by means of expectation error bias variance decomposition method, computing all data's expectation error contribution degree which waited for being sampled. And through geometric analysis of ensemble classifiers classification performance, it be proved that using data which have bigger expectation error contribution degree to train ensemble classifiers, the ensemble classifiers have more classification accuracy. According to that an ensemble classifiers algorithm based on biased sample was proposed in this paper.Further, aiming at the promble that it is hardly to label all data in data stream, a semi-supervised data stream ensemble classifiers algorithm based on cluster assumption was proposed. Although traditional semi-supervised classification algorithm can solve incomplete label data sets classification problem, but it is an unsolved problem that how to use it in data stream environment and how to improve semi-supervised classification algorithm accuracy by using data stream characters. According to analyzing generalization of semi-supervised classifier based on cluster assumption, it indicates that increasing labeled data during training moment can improve semi-supervised classifier accuracy. Making use of this conclusion, a semi-supervised data stream ensemble classifiers algorithm based on cluster assumption was proposed.Finally, aiming at the promble that after training in selective ensemble classifiers, it is determined which individual classifier be selected and be unable to dynamicly adjust with specific data. Two-phase selective ensemble classifiers algorithm of data sreams was presented. Through the analysis it is indicated that individual classifiers be selected by selective ensemble algorithm can have best Classification performance in whole data set, but they may be not optimal combination of individual classifiers to classify specific data. Hence, Dynamic adaptive choosing individual classifiers by Using support vector data description algorithm can Effectively prevent this situations and Improving selective ensemble classifiers classification performance.

Keywords/Search Tags:

Data stream, Ensemble classifiers, Cross validation, Biased sample, Semi-supervised classification

PDF Full Text Request

Related items

1	Research On Semi-supervised Data Stream Classification Method Based On Ensemble Model
2	Semi-Supervised Ensemble For Classification Learning
3	The Research On Dynamic Ensemble Classifiers
4	Research On Semi-supervised Classification Algorithm For Data Stream With Concept Drift
5	Research On Semi-supervised Classification Of Data Stream Based On Adaptive Density Clustering
6	Research On Data Stream Classification Algorithm With Limited Amount Of Labeled Data
7	Semi-supervised Based On Multiple Classifiers Ensemble Model For Semantic Classification Of Teaching Evaluation
8	Research On Feature Selection And Semi-Supervised Classification
9	Research On Semi-supervised Classification Of Data Stream Based On Clustering
10	Research On Semi-Supervised Classification Algorithm For Concept Drift Data Streams Based On Model Reuse