Font Size: a A A

Research On Text Semantic Mining Based On Topic Model And Paragraph Vector

Posted on:2021-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:W W ZhangFull Text:PDF
GTID:2428330605960982Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,all kinds of information data are increasing at an exponential rate.Massive messy text data is distributed in all walks of life,and the tasks of user feature analysis,recommendation system,and public opinion monitoring in text mining rely on high acquisition of quality text data.How to quickly and effectively mine meaningful semantic information from these complex and chaotic texts has become an important task in the field of natural language.The topic model is an effective text topic mining method.It divides the text into several meaningful clusters according to the topic,and all documents in the same category can share a topic.This method has been widely used in the field of text mining.However,short texts distributed in Internet mostly have data sparseness and words in different contexts have different meanings.Therefore,the article introduces Doc2 vec model to the topic model.Based on LDA model and Doc2 vec model,the article has carried out research from the following two aspects:1.In view of the lack of context semantics in the topic model,this paper proposes a DocLDA algorithm.Combining Doc2 vec model containing contextual feature information and LDA models with global text information to process text.In the Doc-LDA model,the text in the corpus is first trained by Doc2 vec to obtain the document vector,then the high probability words in each topic obtained by the LDA model are used to represent the topic,and the topic words are mapped into the vector space to obtain the topic vector.Finally,the text is expressed by calculating cosine distance between topic vector and document vector.This article tested the accuracy,recall and F value of the Doc-LDA model on the crawled abstract corpus.The experimental results show that the text representation model based on LDA and Doc2 vec has further improvements compared to traditional basic model and other similar algorithm.2.In view of the lack of accuracy in the representation of different vector spaces,the paper proposes DBOW-LDA algorithm.The global theme obtained by LDA is integrated into DBOW algorithm.Firstly,LDA model is used to train to obtain the theme distribution.Secondly,the theme distribution obtained by LDA is vectorized and averaged with the text vector in DBOW.The product of the vector matrix of all words in the text and the topic distribution of the text.Finally,output the vector expression of the given text containing the semantic information of the topic.The DBOW-LDA model trains text vectors and topic vectors in the same semantic vector space,which further improves the accuracy of the algorithm.In comparison experiments of clustering results of other basic text representation methods,it can be concluded that the DBOW-LDA algorithm proposed in this paper has better performance.
Keywords/Search Tags:Text Semantic Mining, Topic Model, Paragraph Vector, Text Clustering
PDF Full Text Request
Related items