Font Size: a A A

Research On Semantic Representation Of Text Based On Topic Model

Posted on:2022-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhuFull Text:PDF
GTID:2518306491453104Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
As the prerequisite basis of text mining,text representation directly affects the results and efficiency of text mining such as classification,clustering,retrieval,and automatic sum-marization.At present,the main problems of text representation include“dimensionality disaster",“sparseness"and“semantic loss",etc.,especially the semantic representation of text is currently recognized as a research difficulty in academia.The paper focuses on the difficult problem of text semantic representation,through the integration of supervised learning,transfer learning,topic model and word embedding methods to carry out in-depth research.The key research content of this paper includes the following aspects:(1)This paper proposes a semantic word embedding representation method wt2svec based on the supervised topic model(SLDA).It generates the global topic embedding word vector w_i~zutilizing SLDA which can discover the global semantic information through the latent topics on the whole document set.Meanwhile,it gets the local semantic embedding word vector w_i~cbased on the Word2vec.Therefore,the new semantic word vector w_i~sis obtained by combining the global semantic information with the local semantic information.(2)This paper proposes a semantic word embedding representation method Tr-wt2svec fused with the transfer topic model(Tr-SLDA).It makes use of Tr-SLDA model to identify the latent semantics of cross-domain shared topics.The target domain category and latent-shared topic distribution are provided of global semantic information,the global semantic word vector which is embedded and combined with the local semantic word vector gener-ated by Word2vec,and finally the word vector Tr-wt2svec is generated.(3)On the basis of the above methods,this paper proposes document semantic vector representations doc2svec and Tr-doc2svec.the document semantic vector named doc2svec is generated based on wt2svec model and named Tr-doc2svec is generated based on Tr-wt2svec model.The doc2svec text semantic representation can improve the performance of supervised text classification.And the Tr-doc2svec text semantic representation method can improve the performance of text classification in cross-domain.(4)A verification platform is implemented using Python language,and the above method is experimentally verified on different datasets.The paper conducts experiments to compare the semantic similarity of words and the results of text classification.A large number of comparative experimental results show the effectiveness of the proposed wt2svec semantic embedding model,Tr-wt2svec semantic embedding model,and the text semantic represen-tation method based on these two models.
Keywords/Search Tags:Topic model, Semantic word vector, Document semantic vector, Semantic similarity, Text classification
PDF Full Text Request
Related items