Font Size: a A A

Research On Analysis And Mining Of Scientific And Technical Literatures With Topic Model

Posted on:2018-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:M M WangFull Text:PDF
GTID:2428330512998264Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of technology and network,the amount of scientific and technical literatures increments rapidly.By analyzing and mining these literatures with the methods of data mining,text analysis and big data technologies,the problems including the discovery of hot research topics,the mining of heated papers in the fields,and the grasp of the research directions of the experts could be solved.Therefore scientific and technical literatures are significant to the research and practical aplications.Keyphrase extraction and author topic analysis are two basic tasks among them.However,the performance of some keyphrase extraction algorithms are unsatisfactory and the training of author topic model for large-scale author topic analysis is time-consuming.Under this circumstance,this paper focuses on the research of keyphrase extraction and large-scale author topic analysis.This paper presents a phrase-based topical ranking algorithm for keyphrase extraction.After analyzing the characteristics of the keywords,the algorithm filters the candidate keyphrases.The algorithm mainly uses the LDA topic model to interpret the topic of the document,builds phrase-based relational graph based on the topic information,and uses the weighted-PageRank algorithm to recommend candidates.Experiments first analyze the influence of parameters including filter thresholds of candidates,damping factor of PageRank and number of topics,experiments show that the proposed method outperforms TextRank and TopicRank on several datasets.This paper utilizes author topic model to research on large-scale author topic analysis.However as the training of author topic model is a two-dimensional sampling problem and training complexity is high,there are many problems such as time-consuming problem and unable to stand-alone training problem with massive corpus.Therefore,it is necessary to optimize the sampling algorithm and conduct parallel training for author topic model.This paper presents a delayed updating sampling concept and corresponding Gibbs sampling algorithm MCATM by analyzing the sampling optimization,and then two improved optimization sampling algorithms MHATM and ErgodicATM are proposed based on the idea.MHATM uses the concept of Metropolis-Hastings and the sparseness of the author's topic distribution to reduce the sampling complexity.ErgodicATM reduces the sampling complexity by splitting the two-dimensional sampling problem.Experiments show that the MCATM,MHATM and ErgodicATM sampling algorithms can achieve the same degree of convergence as the original Gibbs sampling algorithm of the author topic model,proving the correctness of the three proposed algorithms.The experiments also show that MHATM and ErgodicATM sampling algorithms can effectively reduce the sampling complexity and thus improve the sampling efficiency.Finally,this paper designs and implements an author topic model sampling framework on Spark,which uses the concept of parameter server to update and transmit global count parameters.The parallel MCATM,MHATM and ErgodicATM sampling algorithms are implemented based on this framework,accomplishing the parallel training of author topic model.Experiments show that the proposed framework can solve the large-scale training problem of author topic model well.Meanwhile experiments show that ErgodicATM and MHATM have good data expansibility,topic expansibility and node expansibility.
Keywords/Search Tags:Analysis of Scientific and Technical Literatures, Keyphrase Extraction, Author Topic Analysis, Topic Model, Parallelization
PDF Full Text Request
Related items