Research On Jointly Learning Word Embeddings And Latent Topics In Text

Posted on:2019-12-04

Degree:Master

Type:Thesis

Country:China

Candidate:L J Wang

Full Text:PDF

GTID:2428330548972418

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The first step in translating natural language processing tasks into machine learning problems is to mathematically encode the input symbols.The most intuitive word representation method is a one-hot representation,which represent each word as a sparse high-dimensional vector.Although this kind of word representation is concise,it can cause dimension disasters and vocabulary gaps.Therefore,the distributed word representation provides a good solution to the disadvantages of the one-hot representation,which can reduce the vector dimension and capture the semantic information of words.Therefore,word embedding models are widely used in various natural language processing tasks.The word embedding model can use the local word collocations in the text corpora to capture the semantic and syntactical information of a word,and learn word representation for each word.The topic model uses the global word collocations in the same document to observe the global information of the text.The topic model maps the document to the low-dimensional topic space and assigns a topic to each word through the word distribution in the corpus.Therefore,the topic model and the word embedding model are complementary in a sense.Combining the two models can capture the corpus information from a global and local perspective.At present,there are many researches about the combination of the word embedding model and topic model,which can be roughly divided into three categories:1)using word embedding to enhance topic modeling;2)using topic models to enhance the effect of word embedding models;3)word embedding models and topic models are jointly learned and mutually enhanced.However,most of the researches are to use topic information in the topic model to solve the polysemy problem in the word embedding model or to use the word embedding as external knowledge to enhance topic modeling.While,there are few studies on jointly learning the word embedding and the latent topics and exploring the interactions between topic model and word embedding model.And the topic model used in the current research is the directed graph model LDA.The learning of the undirected graph model RSM is simple and stable.RSM can be used to model documents of different lengths and it is easy to calculate the posterior distribution of latent topics.Therefore,this paper proposes to combine the word embedding model skip-gram with the topic model RSM,integrate these two models into a unified framework,and jointly learn the word embedding and latent topics and explore the mutual influence between the two.In addition,in order to improve the accuracy of the skip-gram's prediction,this paper adds the center word embedding and the document embedding to obtain the context word embedding,and uses the context word embedding to predict the words around the center word.We conducted corresponding experiments on the public dataset 20 Newsgroups to verify the feasibility and effectiveness of the proposed model.

Keywords/Search Tags:

natural language processing, word embedding, skip-gram, topic model, LDA, RSM

PDF Full Text Request

Related items

1	Unsupervised Extractive Text Summarization Using Sentence Embedding
2	Research On Enhanced Word Embedding Learning Model With Fusion Of Part-of-Speech And Position Information
3	A Study And Implementation Of Document Clustering Based On Word Embedding
4	Joint Learning Methods For Distributed Representations Of Natural Language
5	Research And Application Of Multilingual Text Embedding Model
6	Research On Joint Learning Of Topic And Embedding Model
7	Research On Machine Learning For Natural Language Processing And Transmission
8	Word Embedding Revision Based On Morphological Information And Semantic Lexicons
9	Research On Text Topic Modeling Based On Word Embedding
10	Research On Multi-granularity Chinese Word Embedding Based On Glyph Structure