Font Size: a A A

Research On The Short Text Stream Classification Based On The Corpuse Extension From Wikipedia

Posted on:2020-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:H Y WangFull Text:PDF
GTID:2428330575496948Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many real-world applications such as Social Networks have produced huge-volume short text streams.On one hand,it is difficult for traditional text classification methods in the handling of short text data stream,this is because the high-dimensional and sparsity problem caused by the short length of texts and the lack of semantic information,and the concpet drifting issue hidden in short text data stream.On the other hand,with the rapid growth of short texts,it is not only time-consuming,but also almost impossible to manually label all short texts,thus,how to improve the classification accuracy by making full use of massive unlabeled short texts is also a large challenge in the short text stream classification with fewer labeled short texts.In view of the above problems,this dissertation focuses on short text stream classification based on corpus extension from Wikipedia,and our main work is as follows:(1)We summarize the relevant work of existing short text classification approaches,including the classification approaches of supervised short text classification without/with data stream,semi-supervised short text classification and semi-supervised data stream classification.(2)Due to the characteristics of high dimensional and sparse features,and concept drift in the short text stream,we proposed a short text stream classification algorithm based on text extension and concept drift detection.Specifically,in the method,to make up for the sparsity of data,we firstly obtained the external corpus from Wikipedia to extended short text streams,and used online BTM(Online Biterm Topic Model)to select representative topics instead of the word vector to represent the feature of short texts.Secondly,we proposed a concept drift detection method based on the topic model to detect the hidden concept drifts in short text streams.Thirdly,we built an ensemble model using several data chunks and updated with the newest data chunk and results of the concept drift detection.Experimental results showed that this method has excellent performance in short text stream classification,and the proposed concept drift detection algorithm had good detection performance.(3)Due to the lack of labeled short texts and the massive unlabeled data,we proposed semi-supervised short text stream classification based on label propagation.Firstly,to solve the problem of high dimension and sparsity of features due to the short length of texts,the original word vector set of external corpus from Wikipedia was obtained by Word2 vec,which was used to represent the feature space of short texts.Secondly,the ensemble model was built using the classifiers and cluster models learnt from labeled and unlabeled data respectively,and then the cluster based similarity method was proposed for label propagation.To adapt to the concept drift,a new concept drift detection algorithm based on clusters was proposed.Experimental results showed that the proposed method was effective.
Keywords/Search Tags:short text stream, concept drift, Wikipedia external corpus, online BTM, label propagation
PDF Full Text Request
Related items