Font Size: a A A

Research On Topic Classification For Texts Based On Deep Learning

Posted on:2018-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZhouFull Text:PDF
GTID:2428330596489266Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the coming of big data age,user generated content has become an important foundation of the Internet.Topic classification for these user generated content,as the basic work of data mining,has wide application in content searching and information filtering.The core problem of topic classification for texts can be divided into 2 parts: text representation methods and classification models.In the field of natural language processing,Bag of words is the most widely used text representation method,which considers each document as a union of unordered words.This method ignores document structure information such as order of words and grammar,and also has the problem of data sparsity.As a result,early researches of text classification based on this method usually bring performance improvements only on specific corpora,and these researches cannot meet the demand of massive user generated data.In recent years,researches of the improved text classification methods mainly focus on deep learning methods.This paper gives a research on text classification based on user generated content,using distributed word embedding model as text representation method and convolutional neural networks(CNNs)as classification model.The details are as follows:Research on distributed word embedding methods and experimental analysis.In this section,we research on different word embedding models including random vector,word2 vec and GloVe.We point out the drawbacks of the existing three text representation models and present topic2 vec,a new text representation method based on topic model.This model adds overall topic information into the context space of words and solves the loss of overall document information problem in former word embedding methods.We also evaluate and compare the performance of different word embedding methods through semantic comparison experiment and document classification experiment,and the results show that topic2 vec,compared with other existing word embedding methods,has a remarkable advance in performance.Research on text classification models based on convolutional neural network for Chinese corpora.In this section,we research on the application of convolutional neural network for Chinese corpora's classification,and conduct experiments on Zhihu,a representative Internet user content generating community as Chinese corpora,using different word vectors as input.The experimental output indicates that convolutional neural network with topic2 vec achieves an improvement on both long and short corporas compared with current word embedding methods.
Keywords/Search Tags:Convolutional neural networks, topic model, deep learning, word embedding
PDF Full Text Request
Related items