Font Size: a A A

Research On Short Text Modeling Based On Word Embedding

Posted on:2018-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:S Q LiuFull Text:PDF
GTID:2348330515974046Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development space of both social network and social medium,there is a growing number of users who can access network conveniently-anytime and anywhere,it is also easy for them to enjoy the network services,such as Sina Weibo,Twitter,Baidu Q&A,news comments and shopping comments and so on.When the users are enjoying these network services,massive infomation about short text has been left in the internet field,in which valuable messages are hidden.Therefore,it is urgent for us to dig out these information effectively,when faced with the huge amount of short text resources.Technology about topic modeling has made great success in recent years,it has already been one of the important ways to process text information intelligently.However,applying the traditional topic model to short text directly is facing the sparsity problem,mainly because short text's length is shorter,and it lacks word common information.So when using traditional topic model to infer the topic with word common information on text levels,we are meeting huge challenge.Aiming at solving the problem of short text's sparsity,we add the word embedding to expand the expression of short text,and furthermore we put porward the Latent Word Embedding Modeling,the abbreviation of it is LWEM.Our main work contents are displayed as following:(1)Analyzing and studying the sparsity of short text modeling.It usually happens to short text after they are pre-processed,at least a few words or even more a dozen of words can be sparse of word frequency and word co-present information in document level.It is very difficult to infer the topic structure of short text only based on such limited information.(2)we add the word embedding to expand the expression of short text,and propose a short text topic model based on word embeddings.Word embedding can learn the semantic relations of words from a large number ofcorpus collections.So it is our purpose to strengthen the capability of short text modeling by using word embedding.In our essay,we apply the additivity of word embedding,which is the basic matehmatical properties.Thus we merge word embedding A and B into word embedding C,after this adding word embedding C into the original short text document,to enlarge the meanning of short text and deal with the sparsity problem.LWEM assumes there exists a three layer of document,topic and word embedding,taking the sparsity of short text into consideration,LWEM assumes the corpus set is subject to a topic distribution,and topics are subject to Gaussian distribution.Then the LWEM is based on both the word embedding been observed in original document and the generated word embedding,which is generated from the former two word embedding and also added to the original document.(3)In the next,we compare the performance of the LWEM in topic modeling through doing some experiments.Our experimental data set is based on the short text resources from Twitter and Sina Weibo,both of which are applied practically.And we use Word2 Vec from genisim to train word embedding.And then we apply the DMM,LDA,BTM and LWEM-which is proposed in the essay on the two data sets,analysing the model learing topic performance through the topic consistency and classification performance.Finally experimental results prove that the LWEM we put forward is effective in solving the sparsity problem of short text modeling.
Keywords/Search Tags:short text, topic modeling, Word2vec, word embedding, Gaussian distribution
PDF Full Text Request
Related items