Research On Text Semantic Mining Based On Topic Model And Paragraph Vector

Posted on:2021-01-28

Degree:Master

Type:Thesis

Country:China

Candidate:W W Zhang

Full Text:PDF

GTID:2428330605960982

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,all kinds of information data are increasing at an exponential rate.Massive messy text data is distributed in all walks of life,and the tasks of user feature analysis,recommendation system,and public opinion monitoring in text mining rely on high acquisition of quality text data.How to quickly and effectively mine meaningful semantic information from these complex and chaotic texts has become an important task in the field of natural language.The topic model is an effective text topic mining method.It divides the text into several meaningful clusters according to the topic,and all documents in the same category can share a topic.This method has been widely used in the field of text mining.However,short texts distributed in Internet mostly have data sparseness and words in different contexts have different meanings.Therefore,the article introduces Doc2 vec model to the topic model.Based on LDA model and Doc2 vec model,the article has carried out research from the following two aspects:1.In view of the lack of context semantics in the topic model,this paper proposes a DocLDA algorithm.Combining Doc2 vec model containing contextual feature information and LDA models with global text information to process text.In the Doc-LDA model,the text in the corpus is first trained by Doc2 vec to obtain the document vector,then the high probability words in each topic obtained by the LDA model are used to represent the topic,and the topic words are mapped into the vector space to obtain the topic vector.Finally,the text is expressed by calculating cosine distance between topic vector and document vector.This article tested the accuracy,recall and F value of the Doc-LDA model on the crawled abstract corpus.The experimental results show that the text representation model based on LDA and Doc2 vec has further improvements compared to traditional basic model and other similar algorithm.2.In view of the lack of accuracy in the representation of different vector spaces,the paper proposes DBOW-LDA algorithm.The global theme obtained by LDA is integrated into DBOW algorithm.Firstly,LDA model is used to train to obtain the theme distribution.Secondly,the theme distribution obtained by LDA is vectorized and averaged with the text vector in DBOW.The product of the vector matrix of all words in the text and the topic distribution of the text.Finally,output the vector expression of the given text containing the semantic information of the topic.The DBOW-LDA model trains text vectors and topic vectors in the same semantic vector space,which further improves the accuracy of the algorithm.In comparison experiments of clustering results of other basic text representation methods,it can be concluded that the DBOW-LDA algorithm proposed in this paper has better performance.

Keywords/Search Tags:

Text Semantic Mining, Topic Model, Paragraph Vector, Text Clustering

PDF Full Text Request

Related items

1	Research On Text Summarization Technology Based On Word And Paragraph Vectorization Representation
2	The Research On Short Text Semantic Mining Based On Topic Model And Word Vector
3	Research On Semantic Representation Of Text Based On Topic Model
4	Hierarchical Semantic Structure Based Text Stream Mining
5	Text Semantic Mining Based On Topic Model
6	The Research On Chinese Sentential Semantic Model Parsing And Text Representation
7	Research On Text Clustering Algorithm Based On Word Frequency And Semantic
8	Research On Key Techniques In Text Mining
9	Research And Implementation Of Technology News Service Based On Information Aggregation
10	Research Of Text Mining And Application In Topic Search