Font Size: a A A

Research On Chinese Short Text Classification Based On Word Embedding

Posted on:2019-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2428330569996090Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the extensive use of mobile terminals,people freely comment on and express emotions on various social media platforms and share instant news anytime and anywhere.As a result,a large amount of information with short texts as a carrier had been generated.However,the explosive growth of information resources also has great challenges to the screening and utilization of effective information.Automatic classification of short texts can solve it to a certain extent,replacing the traditional manual management and helping users to locate need information quickly to read and process massive texts selectively.The main contents of this thesis are as follows:1.This thesis first describes the current application background of short text classification and analyzes the short text representation model is key link for further study.Then,this thesis analyzes the shortcomings of short text representation based on summarizing the characteristics of short texts with the traditional representation model---Space Vector Model.This thesis proposed a scheme to describe the short text by borrowing the new representation model named “Word Embedding”,which tried to use the rich contextual semantic information in word embedding to improve the classification effect.At present,the word embedding was limited by the mainstream neural network classification method to the text preprocessing process and did not optimize the word embedding deeply.Therefore,starting from the new text representation model---Word Embedding,this thesis discussed the improvement of the word embedding model to improve the quality of word embedding,and classification effect of short text was improved.2.This thesis further discussed the generated machines and presented a new concept named “Topic Word Embedding” addressed the problem that word embedding cannot solve polysemy of chinese text and the semantic feature express of polysemy.Topic word embedding not only express contextual semantic information,but also the topic information.Further,word embedding belongs to fine-grained feature express and topic embedding can express the relationship extensively between words.In this thesis,word embedding and topic embedding were merged to improve the accuracy of polysemy expression.Furthermore,the modified Topic-SG model was proposed to compute topic word embedding which use topic model based on Skip-Gram model of Word2 vec model,which not only gained word embedding,but also got corresponding topic embedding according to contextual.Then,topic word embedding of same polysemy was gotten in diff-topic by word embedding and topic embedding,which can reduce special influence that express short text with polysemy which present frequently in text.3.This thesis discussed the combination method of short text synthesis processing based on topic word embedding which discussed the problem of different contribution of short text expressed by word.Weighted summation of topic word embedding to express short text vector that was fed classifier to classify short text.4.We evaluate the topic word embedding in the Sougou News platform to express polysemy and implement classification of short text.Experiment results show that Topic-SG language model presented in this thesis can solve the problem polysemy contained in the traditional word embedding and has better effect compared with exist methods.
Keywords/Search Tags:Word2vec Model, Skip-gram Model, Polysemy, Text Vector Representation, Short Text Classification
PDF Full Text Request
Related items