Font Size: a A A

Research On Short Text Classification Algorithms Based On Topic Model And Convolutional Neural Network

Posted on:2018-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z J LiuFull Text:PDF
GTID:2348330563452543Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the network produced a large number of short text,these short text involved in the content and field diversification,and gradually become the use of frequent and recognized communication.E-commerce reviews,information retrieval,intelligent question and answer system are the source of the production of massive short text,how to dig out effective information,is widely studied in recent years.Text categorization is an effective method of text mining.Due to the short length of short text and the sparseness of words,the long text classification method is no longer applicable.There are a lot of researches and explorations on the classification of short texts at home and abroad.The main methods can be divided into two categories: 1)Based on external corpus and knowledge base feature expansion,this method is time-consuming and The process is easy to introduce noise,resulting in limited classification of enhancements.2)Based on the neural network method,this method uses a random initialization word vector or pre-training word vector as input,the text length is short lead to the classification feature is not sufficient.In this paper,the following two aspects of the lack of research,to carry out the following two aspects of research work:Firstly,a short text classification algorithm based on topic model is proposed,which is a kind of short text classification algorithm which is sparsely populated with short text words and uses external corpus to extend the feature.Firstly,the fast double word theme model is proposed based on the double word theme model,and the single sampling complexity in the iteration is reduced from O(K)to O(1),and the algorithm of the corresponding term is given.Then,Using the fast double word theme model to model the short text,and the two subject words in the text sliding window are composed of word pairs.Finally,the theme distribution is used as the other part,the word,the subject word,the theme The distribution features are combined after classification.The results of the Weibo data set show that the feature expansion and classification algorithm based on the fast double word theme model can effectively improve the accuracy,recall rate and F1 value of short text classification.Secondly,a short text classification algorithm(CNN-RF)based on convolution neural network and random forest is proposed.Firstly,two sets of word vectors are preliminarily trained in different ways,and the two pooled layer features are obtained as input of two convolution-pool layers.Then,the two pooled layer feature maps are convoluted.Finally,the model training process is divided into two phases: 1)Softmax pre-training: access the Softmax classifier to the second layer convolution feature map,train and save the model parameters;2)random forest training : Keep the pre-training phase model parameters unchanged,the classifier to a random forest,the use of secondlevel convolution characteristics of training random forest,to enhance the generalization of the model.The results of the three public data sets show that CNNRF can effectively improve the accuracy,recall and F1 values of short text classification.
Keywords/Search Tags:Short text classification, topic model, feature expansion, convolution neural network, random forest
PDF Full Text Request
Related items