Research On Chinese Short Text Classification Based On Word Embedding

Posted on:2019-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:J Wang

Full Text:PDF

GTID:2428330569996090

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and the extensive use of mobile terminals,people freely comment on and express emotions on various social media platforms and share instant news anytime and anywhere.As a result,a large amount of information with short texts as a carrier had been generated.However,the explosive growth of information resources also has great challenges to the screening and utilization of effective information.Automatic classification of short texts can solve it to a certain extent,replacing the traditional manual management and helping users to locate need information quickly to read and process massive texts selectively.The main contents of this thesis are as follows:1.This thesis first describes the current application background of short text classification and analyzes the short text representation model is key link for further study.Then,this thesis analyzes the shortcomings of short text representation based on summarizing the characteristics of short texts with the traditional representation model---Space Vector Model.This thesis proposed a scheme to describe the short text by borrowing the new representation model named �Word Embedding�,which tried to use the rich contextual semantic information in word embedding to improve the classification effect.At present,the word embedding was limited by the mainstream neural network classification method to the text preprocessing process and did not optimize the word embedding deeply.Therefore,starting from the new text representation model---Word Embedding,this thesis discussed the improvement of the word embedding model to improve the quality of word embedding,and classification effect of short text was improved.2.This thesis further discussed the generated machines and presented a new concept named �Topic Word Embedding� addressed the problem that word embedding cannot solve polysemy of chinese text and the semantic feature express of polysemy.Topic word embedding not only express contextual semantic information,but also the topic information.Further,word embedding belongs to fine-grained feature express and topic embedding can express the relationship extensively between words.In this thesis,word embedding and topic embedding were merged to improve the accuracy of polysemy expression.Furthermore,the modified Topic-SG model was proposed to compute topic word embedding which use topic model based on Skip-Gram model of Word2 vec model,which not only gained word embedding,but also got corresponding topic embedding according to contextual.Then,topic word embedding of same polysemy was gotten in diff-topic by word embedding and topic embedding,which can reduce special influence that express short text with polysemy which present frequently in text.3.This thesis discussed the combination method of short text synthesis processing based on topic word embedding which discussed the problem of different contribution of short text expressed by word.Weighted summation of topic word embedding to express short text vector that was fed classifier to classify short text.4.We evaluate the topic word embedding in the Sougou News platform to express polysemy and implement classification of short text.Experiment results show that Topic-SG language model presented in this thesis can solve the problem polysemy contained in the traditional word embedding and has better effect compared with exist methods.

Keywords/Search Tags:

Word2vec Model, Skip-gram Model, Polysemy, Text Vector Representation, Short Text Classification

PDF Full Text Request

Related items

1	Research On Short Text Emotion Classification Method Based On Word2Vec And N-Gram
2	Research On Text Classification Based On Word2vec Word Vector
3	Research On Text Semantic Enhancement And Short Text Classification Method Based On Topic Model
4	Research Of Text Classification Based On Word2vec And Self-attention
5	Text Representation Model Based On Semantics And Structured Tensor
6	Research On Graph Model-based Short Text Classification Algorithm
7	Classification Of News Short Text Based On Deep Learning
8	The Modeling And Implementation Of Text Sentiment Classification Based On Skip Method Model
9	Short Text Classification Based On Apriori Algorithm
10	Research Of The Short-text Classification Based On The Domain Knowledge Base