Research On Joint Learning Of Topic And Embedding Model

Posted on:2019-07-04

Degree:Master

Type:Thesis

Country:China

Candidate:Q Xiao

Full Text:PDF

GTID:2428330545477513

Subject:Computer Science and Technology

Abstract/Summary:

In the era of big data,with the rapid development of Internet technology,a large amount of data has been continuously generated on the Internet,causing explosive growth of data,which contains a large amount of natural language text data.Text data is one of the most valuable and valuable data resources for data.Therefore,the analysis and mining of text data has important theoretical research significance and practical application value.The primary task of text analysis is to mine the latent semantics of the text.Topic model and embedding model are the two most important models in text latent semantic learning.Due to the complementarity between the two models,many researchers have tried to combine the two models in recent years.However,the existing combination methods cannot improve the performance of both models at the same time through joint learning,and lack generality.Based on the above problems,this paper studies the topic model and embedding model joint learning method.The main research work and contributions of the paper are as follows:(1)We propose a general topic model and embedding model joint learning method and algorithm framework named HieraVec.On the one hand,HieraVec can make use of more information to improve the quality of the original distributed vector.On the other hand,the distributed representation of natural language can be used to better complete more coherent topic modeling so as to achieve better practical training results Because of the diversity of the parameters of HieraVec framework,the single parameter optimization method is difficult to learn all the parameters at the same time.Therefore,this paper studies and designs a rotation optimization method,the 3-stage learning procedure,to optimize the parameters of the algorithm framework.(2)We propose two algorithms based on the above framework,HieraVecPD algorithm combines the multi layer semantic information of pLSA model into the embedding model Doc2Vec,and HieraVecLW algorithm combines Word2Vec and LDA model to train the topic model that distributed vector enhance.We conducted several experiments to evaluate the accuracy improvement of the distributed vector and topic model learned by the two models.The experiment results proves that the joint learning method of th e HieraVec algorithm framework can improve the performance of the topic model and the embedding model at the same time,and has good generality.(3)We design and inplement parallel a joint training method and framework for large-scale corpus based on Spark platform,and the distributed version of the HieraVecPD and HieraVecLW algorithms is implemented on this framework.The experimental results show that the parallel joint training method and framework can effectively solve the problem of latent semantic analysis of text in large corpus,and the HieraVecPD and HieraVecLW algorithms have good data scalability and node scalability.

Keywords/Search Tags:

Natural Language Processing(NLP), Text Modelling, Topic model, Embedding model, Text Mining, Representation learning, Parallel

Related items

1	A Research On Text Vector Representations And Modelling Based On Neural Networks
2	Research And Application Of Topic Model For Short Texts Based On Part-of-Speech Feature And Semantic Enhancement
3	Methods For Phrase-based Text Mining And Analysis
4	Research On Text Representation Model And Application In Text Classification And Natural Language Inference
5	The Research On Chinese Sentential Semantic Model Parsing And Text Representation
6	Improved Sentence Embedding Based On BERT And Prompt-learning
7	Research On Chinese-Oriented Hybrid Embedding Text Representation Method
8	Improved Text Topic Representation And Learning Method
9	Research On Jointly Learning Word Embeddings And Latent Topics In Text
10	Joint Learning Methods For Distributed Representations Of Natural Language