Research On Text Stream Classification By Keywords

Posted on:2012-02-18

Degree:Master

Type:Thesis

Country:China

Candidate:B G Yang

Full Text:PDF

GTID:2218330344951700

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Traditional data stream classification usually requires a great number of fully labeled training examples to build classifiers, which is expensive and time consuming. However, in real life, the data streams are basically unlabeled, which makes the traditional data stream methods impractical. To address this problem, in recent years, research based on semi-supervised data stream classification methods has been increasingly concerned. Some researchers proposed to use partly labeled examples or only small positive examples and large amount of unlabeled examples for data stream classification. Although these approaches have reduced the cost of manual labeling, it still requires users to label some samples.To further release the burden of manual labeling, as for the text data stream classification, this paper proposed a novel approach, which uses keywords to classify text streams without manual labeling. First of all, the base classifier is built by keywords and unlabeled documents, then the documents in text stream are classified by ensemble based algorithm. In the classifier construction phase, keywords are semantically expanded and then used to label the initial positive documents. At the classification stage, the final label of unknown document is predicted by the weighted majority voting algorithm.In this paper, the concept drift in the text stream is also intensively studied. Concept drift arisen by the change of user's interests is mainly explored in this work, and the keywords provided by the user determine the user's current interests and the target concepts. Therefore, when the user's interest changes, the concept drift will occur as well. This paper also simulates the common concept drift scenarios, namely, the gradual concept drift and abrupt concept shift. Furthermore, a comparative analysis is also conducted between the concept drift scenarios and the non-drift scenario.Experimental results demonstrate that the proposed method can build an excellent classifier by keywords without using any manual labeled examples, which can achieve comparable results compared with the PU learning method building classifiers by labeled positive and unlabeled documents. Moreover, the classifier ensemble method used in this paper can quickly capture and adapt to the concept drift in the text streams. Experiment results also show that the ensemble based algorithm performs better than single window based algorithm. The method proposed in this paper for text stream classification does not require manual labeled documents, which will be more practical for real-life applications.

Keywords/Search Tags:

text stream classification, unlabeled documents, concept drift, classifier ensemble, knowledge acquisition

PDF Full Text Request

Related items

1	Research On Text Stream Classification By Keywords
2	Research On Ensemble Classification Algorithms Of Data Stream Based On Concept Drift
3	Research On Concept Drift Detection And Ensemble Classifier Based On Data Stream
4	Research On Hybrid Ensemble Model Based Data Stream Classification With Unlabeled Data
5	Research And Implementation Of Classification Algorithm For Positive And Unlabeled Examples Learning On Uncertain Data Stream
6	Research On Data Streams Classification With Concept Drift
7	Research On Concept Drift Data Stream Classification Based On Ensemble Learning
8	Classifier Ensemble For Data Stream Classification
9	Research On Concept Drift Detection In Data Stream And Classification Algorithms For Imbalanced Data Stream
10	Research On Classification Algorithms For Imbalanced Data Stream With Concept Drift