Font Size: a A A

Research On Classification Methods For Textual Data Stream Based On Clustering Forest

Posted on:2015-05-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:G SongFull Text:PDF
GTID:1108330479478725Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technologies, modern applications usually produce large-scale data streams(especially textual data streams). It is technically di?cult to handle the textual stream classification by traditional classification methods,because the textual stream is characterized by having the high-dimensional feature space,a large number of instances and the concept drift. Therefore, the textual stream mining has caught broad attention in the research community over the past few years. Four kinds of algorithms based on the ensemble learning are proposed in this dissertation to solve research problems of the textual stream classification.In summary, a textual stream has the following five features. The first feature is the Concept Drift. The second feature is that a textual stream usually contains a large number of instances with the high-dimensional feature space. Third, since it is hardly to manually label all data in time, a large number of unlabeled instances are included in the textual stream. Fourth, skewed class distributions can be seen in many textual streams. Finally, in many real applications, each instance should be labeled with more than one label(as a label set). Lots of possible label sets may lead to the complexity of classification algorithms.To deal with classification problems generated by the above features of textual stream,this dissertation proposes four kinds of textual stream classification methods. The main research works are described as follows:First of all, to deal with the large-scale textual stream with high-dimensional feature space in concept drifting environment, Dynamic Clustering Forest(DCF) based on ensemble learning is proposed. DCF includes two new strategies: an adaptive ensemble strategy and a voting strategy. The adaptive ensemble strategy enables us to choose the clustering trees flexibly according to the accuracy weight. In order to take into account the information of the historical data and the latest data, a voting strategy is implemented by considering a credibility weight and an accuracy weight. Moreover, we conduct a theoretical analysis on the performance of DCF. The effectiveness of DCF was examined in eight synthetic and real-world datasets. Experimental results demonstrate that DCF performs better with respects to the average accuracy and the plotting accuracy.Second, for textual stream with partial labeled instances, a new semi-supervised clustering forest(called CCEM-PL) is presented. In CCEM-PL, a new semi-supervised clustering tree(SCT) is proposed as the sub-classifier. According to real nodes and virtual nodes generated in SCT, unlabelled training instances not only help to divide the boundary of classes, but also reflect the distribution of current concept. The new real accuracy weight and the similarity weight are defined according to the structure of SCT. The final prediction is integrated by SCTs based on these two kinds of weights. Experimental results in four textual streams demonstrate the effectiveness of CCEM-PL.Third, to deal with the classification of textual stream in the imbalanced environment, we propose a new ensemble framework, Clustering Forest for classifying the Imbalanced textual stream with concept drift(CFIM). To handle the drifting of rare-classes,a new dynamic resample strategy is designed in CFIM. In this strategy, instances in rareclass subset and misclassified subset should be sampled from the historical chunks, whose distributions are as similar as the latest chunk. This dynamic resample strategy not only balances the number of instances between rare-classes and majority-classes, but also strengthens the training of misclassified instances. Experimental results in five imbalanced textual streams have shown that the proposed algorithm outperforms traditional textual stream classification methods.Finally, to address the challenge of classifying multi-label stream in evolving environment, we propose a new Multi-Label Dynamic Ensemble(MLDE) approach. The proposed MLDE integrates a number of multi-label cluster-based classifiers. After selecting suitable multi-label cluster-based classifiers, final prediction results are obtained by the subset accuracy weight and subset credibility weight in the voting strategy. Experimental results reveal that MLDE achieves better performances than other four state-of-the-art multi-label stream classification algorithms.
Keywords/Search Tags:Textual stream classification, Concept drift, Imbalanced textual stream, Semisupervised learning, Ensemble learning
PDF Full Text Request
Related items