Research On The Short Text Stream Classification Based On The Corpuse Extension From Wikipedia

Posted on:2020-07-11

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Wang

Full Text:PDF

GTID:2428330575496948

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Many real-world applications such as Social Networks have produced huge-volume short text streams.On one hand,it is difficult for traditional text classification methods in the handling of short text data stream,this is because the high-dimensional and sparsity problem caused by the short length of texts and the lack of semantic information,and the concpet drifting issue hidden in short text data stream.On the other hand,with the rapid growth of short texts,it is not only time-consuming,but also almost impossible to manually label all short texts,thus,how to improve the classification accuracy by making full use of massive unlabeled short texts is also a large challenge in the short text stream classification with fewer labeled short texts.In view of the above problems,this dissertation focuses on short text stream classification based on corpus extension from Wikipedia,and our main work is as follows:(1)We summarize the relevant work of existing short text classification approaches,including the classification approaches of supervised short text classification without/with data stream,semi-supervised short text classification and semi-supervised data stream classification.(2)Due to the characteristics of high dimensional and sparse features,and concept drift in the short text stream,we proposed a short text stream classification algorithm based on text extension and concept drift detection.Specifically,in the method,to make up for the sparsity of data,we firstly obtained the external corpus from Wikipedia to extended short text streams,and used online BTM(Online Biterm Topic Model)to select representative topics instead of the word vector to represent the feature of short texts.Secondly,we proposed a concept drift detection method based on the topic model to detect the hidden concept drifts in short text streams.Thirdly,we built an ensemble model using several data chunks and updated with the newest data chunk and results of the concept drift detection.Experimental results showed that this method has excellent performance in short text stream classification,and the proposed concept drift detection algorithm had good detection performance.(3)Due to the lack of labeled short texts and the massive unlabeled data,we proposed semi-supervised short text stream classification based on label propagation.Firstly,to solve the problem of high dimension and sparsity of features due to the short length of texts,the original word vector set of external corpus from Wikipedia was obtained by Word2 vec,which was used to represent the feature space of short texts.Secondly,the ensemble model was built using the classifiers and cluster models learnt from labeled and unlabeled data respectively,and then the cluster based similarity method was proposed for label propagation.To adapt to the concept drift,a new concept drift detection algorithm based on clusters was proposed.Experimental results showed that the proposed method was effective.

Keywords/Search Tags:

short text stream, concept drift, Wikipedia external corpus, online BTM, label propagation

PDF Full Text Request

Related items

1	Research On Short Text Data Stream Classification Based On Feature Extension And Selection
2	Concept Drift Detection Algorithm Based On Multi-label Learning With Label Special Features
3	Research On Classification Algorithm Of Concept Drift Data Stream Based On Online Transfer Learning
4	Online Learning Technology On Abstract Extraction System In Short Text Stream
5	Research On Ensemble Classification Algorithms Of Data Stream Based On Concept Drift
6	Research Of Concept Drifting Detection In Text Data Stream
7	Research On Online Learning For Concept Drift
8	Online Concept Drift Detection Based On Data-windows
9	Research On Distributed Classification Methods For Short Text Data Streams
10	Research On Online Ensemble Classification Algorithm Based On Concept Drift Detection