Font Size: a A A

Text Classification Based On Semi-supervised Learning

Posted on:2019-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:X M SunFull Text:PDF
GTID:2428330566996845Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Faced with a large amount of miscellaneous text information in the Internet,automatic text classification technology can automatically classify and distinguish these text information.It is widely used in e-mail classification,query intent prediction,search engines,topic tracking,information filtering,and other fields.It can help users accurately classify complex and complicated data,thus obtaining classified text information more efficiently and accurately,and solving the problem of quick positioning of user's required information.Early text categorization methods typically required a large number of labeled training data sets to train the text classifier in a supervised way.However,it takes a lot of manpower to obtain labeled text data sets,and often the classifiers trained with labeled data sets have poor generalization performance.However,there are a large number of unlabeled data on the Internet,which are easy to obtain.People began to study methods for classifying texts using semi-supervised learning.Semi-supervised text classification techniques use both labeled and unlabeled corpus for training.Under the guidance of the supervised information of labeled data in the training set,the unlabeled samples are also used to improve the classification performance of the classifier.The research work of this paper is mainly divided into the following aspects:(1)Introduce and analyze the classical text classification methods,and compare in detail the advantages and disadvantages of classical text classification methods.And based on the classical text classification method for related experiments.(2)Based on the deep learning method,a text classifier based on LSTM is constructed and the idea of adversarial training is introduced into it.By adding the method of adversarial perturbation to the input word embedding of LSTM,the semantic expression of the word embedding is more adequate,and the words with similar grammatical structure but different semantics can be distinguished.Through the residual network architecture,the semantic expression ability of word embedding is further improved.Then use the built classifier for semi-supervised tasks.(3)In order to further extract the category information in the document representation and improve the classification performance,the self-attention mechanism is introduced into the classifier.The self-attention mechanism can simply and efficiently learn the internal structure of the sentence,so as to extract information from different aspects of the text,which can be used in text categorization tasks.This article refers to single-latitude self-attention and multidimensional self-attention,respectively.The experimental results show that the categorization model with the attention mechanism document introduced is more fully characterized and the classification performance is better.Compared with the baseline system,the accuracy rate increased by 3%;under the same word embedding pre-training strategy,the accuracy rate of the model proposed in this paper reached 0.933,and also achieved better classification results.(4)Pre-training the word vectors using RNNLM and autoencoder language models,respectively,to explore the influence of different pre-training strategies on the performance of the classification model;to explore the effect of different quantity of labeled data on the performance of the classification model by changing the proportion of labeled data.Experiments show that the proposed classification model can achieve better classification results when the labeled data quantity is smaller compared to baseline system.When the labeled data is reduced to 20%,the classification model proposed in this paper improves the classification accuracy by about 5 percentage points compared with the baseline system.
Keywords/Search Tags:semi-supervised text classification, LSTM, adversarial training, residual network, self-attention mechanism
PDF Full Text Request
Related items