Font Size: a A A

The Research Of Short Text Feature Extension Method Based On Word Embedding

Posted on:2018-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:X MengFull Text:PDF
GTID:2348330515474048Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity and development of the Internet and mobile devices,communications between people become more immediate and convenient.SMS,QQ,micro-blog and many other online social media have become more indispensable in our life.Besides,this also makes an explosion in short texts,which brings new challenges to the traditional long-text-based automatic information processing and text mining technology.How to solve the problem of short feature and short coverage is becoming an active research subject,which have attracted the attention of many experts and scholars.And currently,the method to extend the characteristics of short texts is the most direct and effective way.In the applications of Natural Language Processing,Word Embedding is an important and inevitable tend.Word Embedding is a kind of vector representation method of words,which is different from the original independent word representation.It divides words according to the intensity of semantic relations in the vector space of relatively low dimension,and encodes the explicit and implicit rules.Word Embedding makes the word vector contains a lot of semantic information.In this paper we proposed a new text feature extension method,which extended the Word Embedding as the basis of the short text.Our main steps are as follows:1.Training Word Embedding according to large-scale corpus.Word Embedding training model is a model of neural network structure.According to the development process and different requirements of Word Embedding,this paper introduces four kinds of Word Embedding models: Neural network language model,Recurrent neural network model,CBOW and Skip_gram.Combined with other scholars' research on the model and the task requirements of this paper,the Skip_gram model is selected as the Word Embedding training model.At the same time,this paper chooses the WIKI encyclopedia English database with rich content and large amount of data as the training data of the model,and obtains more than 2 million words corresponding Word Embedding.2.Using simple vector computation to accomplish the simple reasoning in short text according to the characteristics of the Word Embedding.Some language rules can be represented by the subtraction and addition between Word Embedding.In this paper,we use this characteristic in the short text word order corresponding to the sequence in order to get the semantic expression vector.The resulting inference vector and Word Embedding belong to the same vector space.3.Using Word Embedding clustering to express extended feature space.Different from the traditional fine-grained semantic representation units(words,phrases,concepts,etc.),in this paper we get "semantic unit",which is based on semantic similarity degree obtained by the automatic division,according to the distribution characteristics of Word Embedding.With "semantic unit" as extension features of items,and the vector expression of the same dimensions(including the short of the corresponding Word Embedding vector and introduce the Word before Embedding reasoning vector)can be mapped to expand on the feature space.Finally,we used short text feature expansion method based on the Word Embedding to carry on the short text classification and the short text clustering experiment.For the Google search fragment data sets and China Daily news data sets,compared with the method based on LDA the classification accuracy was increased by 3.7% and 1%,and compared with the traditional clustering F-measure clustering methods was increased by 30.64%,17.54%.The experimental results showed that the proposed method can express the short text information better and improve the sparse feature and low coverage of the text.
Keywords/Search Tags:Word Embedding, Text feature extension, Short text classification, Short text clustering
PDF Full Text Request
Related items