Font Size: a A A

Research On Joint Learning Of Topic And Embedding Model

Posted on:2019-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:Q XiaoFull Text:PDF
GTID:2428330545477513Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,with the rapid development of Internet technology,a large amount of data has been continuously generated on the Internet,causing explosive growth of data,which contains a large amount of natural language text data.Text data is one of the most valuable and valuable data resources for data.Therefore,the analysis and mining of text data has important theoretical research significance and practical application value.The primary task of text analysis is to mine the latent semantics of the text.Topic model and embedding model are the two most important models in text latent semantic learning.Due to the complementarity between the two models,many researchers have tried to combine the two models in recent years.However,the existing combination methods cannot improve the performance of both models at the same time through joint learning,and lack generality.Based on the above problems,this paper studies the topic model and embedding model joint learning method.The main research work and contributions of the paper are as follows:(1)We propose a general topic model and embedding model joint learning method and algorithm framework named HieraVec.On the one hand,HieraVec can make use of more information to improve the quality of the original distributed vector.On the other hand,the distributed representation of natural language can be used to better complete more coherent topic modeling so as to achieve better practical training results Because of the diversity of the parameters of HieraVec framework,the single parameter optimization method is difficult to learn all the parameters at the same time.Therefore,this paper studies and designs a rotation optimization method,the 3-stage learning procedure,to optimize the parameters of the algorithm framework.(2)We propose two algorithms based on the above framework,HieraVecPD algorithm combines the multi layer semantic information of pLSA model into the embedding model Doc2Vec,and HieraVecLW algorithm combines Word2Vec and LDA model to train the topic model that distributed vector enhance.We conducted several experiments to evaluate the accuracy improvement of the distributed vector and topic model learned by the two models.The experiment results proves that the joint learning method of th e HieraVec algorithm framework can improve the performance of the topic model and the embedding model at the same time,and has good generality.(3)We design and inplement parallel a joint training method and framework for large-scale corpus based on Spark platform,and the distributed version of the HieraVecPD and HieraVecLW algorithms is implemented on this framework.The experimental results show that the parallel joint training method and framework can effectively solve the problem of latent semantic analysis of text in large corpus,and the HieraVecPD and HieraVecLW algorithms have good data scalability and node scalability.
Keywords/Search Tags:Natural Language Processing(NLP), Text Modelling, Topic model, Embedding model, Text Mining, Representation learning, Parallel
PDF Full Text Request
Related items