Font Size: a A A

Research On Text Classification Based On Active Self-Paced Learning

Posted on:2019-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:T MeiFull Text:PDF
GTID:2428330572955611Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Today there are various kinds of information around us,as one of the most common text information carriers,text classification is inspired.Traditional text classification methods can not obtain satisfactory accuracy,and often require a large amount of manual annotation,which costs a lot of labor.So how to improve the accuracy of text classification,and reduce the labor of manual annotation has become a very significant problem.Firstly the existing text categorization methods are analyzed in this thesis.Because of the high cost of manual annotation,only a part of unlabeled text data can be manually annotated,but if there are not enough labeled samples,the supervised learning method often performs bad.In order to solve this problem,the semi-supervised learning methods such as self-training and co-training have been put forward one after another.With using the unlabeled samples,the classification effect has been improved to a certain extent,but it is still not satisfactory.In order to improve the accuracy of text categorization,we use active learning method to annotate data with high information in the iterative process.Then,we introduce self-paced learning and propose a text classification algorithm based on active self-paced learning to further improve classification performance.Aiming at the problem that the accuracy of text categorization tasks is low when the number of manual annotations is small,a text categorization algorithm based on active learning is proposed.Firstly,the feature extraction scheme is studied and improved.word2 vec model is used to propose text features based on word meanings,and combined with TF-IDF features based on word frequency to form a comprehensive text feature.After that,a text classification method based on active learning is proposed.A small number of samples are manually annotated and the SVM classifier is used to train the initial classification model.The classification model is used to evaluate the information of the unlabeled samples,and the samples with large amount of information are selected and added to the training set,and the final text classification model is obtained after several iterations.Aiming at the problem that the text classification method based on active learning only selects informative samples but does not use other samples,we introduces self-paced learning and the text classification method based on active self-paced learning is proposed.In the iterative process,the SVM classification model is used to predict and calculate the confidence of the unlabeled samples.The samples with higher confidence are used to be pseudoly labeled with the prediction label automatically.The pseudo-labeled samples at the self-paced learning stage and the manually labeled samples at the active learning stage are added into the training set to update the classification model.Using the above methods,we conduct experiments on Sohu news and Sina News dataset.The experimental results show that the text classification method based on active learning can significantly improve the classification accuracy;The text classification method based on active self-paced learning can further improve the accuracy of the text,and can reach at the accuracy with less annotation at which the supervised learning method can reach with a large amount of manual annotation.
Keywords/Search Tags:Text Classification, Active Learning, Self-paced Learning, Feature Extraction
PDF Full Text Request
Related items