Font Size: a A A

Research On Jointly Learning Word Embeddings And Latent Topics In Text

Posted on:2019-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:L J WangFull Text:PDF
GTID:2428330548972418Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The first step in translating natural language processing tasks into machine learning problems is to mathematically encode the input symbols.The most intuitive word representation method is a one-hot representation,which represent each word as a sparse high-dimensional vector.Although this kind of word representation is concise,it can cause dimension disasters and vocabulary gaps.Therefore,the distributed word representation provides a good solution to the disadvantages of the one-hot representation,which can reduce the vector dimension and capture the semantic information of words.Therefore,word embedding models are widely used in various natural language processing tasks.The word embedding model can use the local word collocations in the text corpora to capture the semantic and syntactical information of a word,and learn word representation for each word.The topic model uses the global word collocations in the same document to observe the global information of the text.The topic model maps the document to the low-dimensional topic space and assigns a topic to each word through the word distribution in the corpus.Therefore,the topic model and the word embedding model are complementary in a sense.Combining the two models can capture the corpus information from a global and local perspective.At present,there are many researches about the combination of the word embedding model and topic model,which can be roughly divided into three categories:1)using word embedding to enhance topic modeling;2)using topic models to enhance the effect of word embedding models;3)word embedding models and topic models are jointly learned and mutually enhanced.However,most of the researches are to use topic information in the topic model to solve the polysemy problem in the word embedding model or to use the word embedding as external knowledge to enhance topic modeling.While,there are few studies on jointly learning the word embedding and the latent topics and exploring the interactions between topic model and word embedding model.And the topic model used in the current research is the directed graph model LDA.The learning of the undirected graph model RSM is simple and stable.RSM can be used to model documents of different lengths and it is easy to calculate the posterior distribution of latent topics.Therefore,this paper proposes to combine the word embedding model skip-gram with the topic model RSM,integrate these two models into a unified framework,and jointly learn the word embedding and latent topics and explore the mutual influence between the two.In addition,in order to improve the accuracy of the skip-gram's prediction,this paper adds the center word embedding and the document embedding to obtain the context word embedding,and uses the context word embedding to predict the words around the center word.We conducted corresponding experiments on the public dataset 20 Newsgroups to verify the feasibility and effectiveness of the proposed model.
Keywords/Search Tags:natural language processing, word embedding, skip-gram, topic model, LDA, RSM
PDF Full Text Request
Related items