Font Size: a A A

Research On Short Text Classification Of Semi-supervised Pre-training Based On Autoencoders And Word Order Dependencies

Posted on:2021-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:B GuanFull Text:PDF
GTID:2428330614465909Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology and the arrival of the era of intelligence,the global information reserves show an exponential growth trend.As an important carrier of information interaction,the short text is especially active in social networks with a large number of users and in daily comments.These unstructured essays contain a lot of valuable information that requires complex engineering and can be very expensive to extract manually.Therefore,the use of machine learning to annotate a large number of unmarked short texts in the Internet and how to efficiently organize and manage the data of short texts has become one of the hot topics in the current natural language processing(NLP)task.At present,the pre-trained language model based on deep learning has been proved to effectively improve the effect of text classification.The basic idea is to pre-train the language model from a large number of unmarked text and fine-tune it through the downstream task of supervision.However,these models require large amounts of reliable data and industry-level computer resources,which limits their use in resource-constrained environments.In addition,compared with the long text,the short text classification is faced with the difficulties of fewer feature words and irregular diction.Therefore,short text classification is generally optimized and improved in preprocessing,text representation,classifier construction and other links to improve the speed and accuracy of classification.Based on the above requirements and problems,this paper mainly focuses on a lightweight semi-supervised text pre-training classification method.Firstly,the variational document model is used to preprocess a large number of unmarked short texts,extract the probability distribution characteristics of hidden variables in the text data,and then take the internal state of the pre-training model as the feature input of the downstream classifier.As a variant of the generated model,this method has gained a competitive advantage in the task of short text classification under the condition of limited data and computation.However,there are some problems in the existing models which need to be optimized.Based on these problems,DPCNN and Free Bits technology were used to improve.The experimental results show that the improved model is more effective than the original model in the task of text classification.
Keywords/Search Tags:Short Text Classification(STC), Semi-supervised, Variational Encoder, Neural Network, Pre-training Language Model
PDF Full Text Request
Related items