Font Size: a A A

Research On Chinese Short Text Classification Based On Semi-Supervised Clustering

Posted on:2021-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y DuFull Text:PDF
GTID:2428330629487248Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,a large number of short text data appear in the network.How to mine the valuable information in these data is a new hotspot of current research.Short text has the characteristics of high-dimensional sparsity,and a large number of fast update,which brings about problems such as annotation bottleneck and concept drift,making the effect of traditional text algorithm in short text classification not very effective.Therefore,it is necessary to study a short text classification algorithm and system.Semi supervised clustering can combine a small number of labeled samples and a large number of unlabeled samples for learning,and effectively discover the distribution characteristics of samples.In this thesis,we combine semi supervised clustering to improve the performance of short text classification,to a certain extent,make up for the lack of labeled samples and improve the problem of category imbalance and concept drift.The main contents of this thesis are as follows:Firstly,a new short text classification algorithm combining semi supervised clustering is proposed.In view of the high-dimensional sparse characteristics of short text,this thesis proposes an improved semi supervised kmeans algorithm,which improves the distance measurement and centroid iteration.Then a fusion algorithm is designed to fuse the prediction results of semi supervised kmeans and SVM to further improve the prediction accuracy.This method can realize the complementary advantages of semi supervised clustering and classification algorithm in short text classification,make full use of a large number of unmarked samples in short text data set,and improve the problem of category imbalance.Secondly,a new cooperative training framework(SCC co training)is proposed,which combines semi supervised kmeans and SVM.The framework uses the differences between semi supervised clustering and classification to improve the generalization ability.Through iterative training,the advantages of semi supervised clustering algorithm and classification algorithm are complemented.Under the framework of SCC co training,the objective functions of the two learning models are redefined to improve the algorithm,which improves the problem of annotation bottleneck and concept drift in the field of short text classification.Finally,based on the above algorithm,a Chinese short text classification system is designed and implemented,including four modules:(1)preprocessing module.For the original short text data,a series of operations are carried out,such as parsing,segmentation,de stop words and so on.(2)The feature processing module realizes the feature representation and selection of the data set.(3)Algorithm training module,based on the processed short text data set and using the algorithm proposed in this thesis for model training.(4)The text classification module realizes the prediction of the test text and the saving of the result file.By comparing the experimental results on 11 short text datasets with other short text classification algorithms,the effectiveness of the algorithm proposed in this thesis is proved.To some extent,it solves the problem of insufficient labeled samples and improves the problem of category imbalance and concept drift.
Keywords/Search Tags:short text classification, semi supervised clustering, collaborative training, SVM, kmeans
PDF Full Text Request
Related items