Font Size: a A A

Research On Text Keyphrases Extraction Algorithm Based On TopicRank

Posted on:2022-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:Z X ZhengFull Text:PDF
GTID:2518306350951629Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text keyphrases extraction is a natural language processing technology that extracts some phrases that are most related to the meaning of the text.It is an important research direction in the field of natural language processing,and has important practical application value in the fields of information retrieval,library science,information science and so on.Especially in the 21st century,with the rapid development of the Internet,mobile Internet and Internet of things,data of text type shows an exponential growth trend.For enterprises and some organizations,these text data have great potential application value.However,how to quickly understand and use these text data is a major practical problem for enterprises.Building an effective keyphrases automatic extraction system is one of the feasible and necessary ways to use these text data.Among the numerous text keyphrases extraction algorithms,TopicRank algorithm is a typical graph ranking algorithm.It introduces the keyphrase clustering process on the basis of the famous TextRank algorithm,and then uses the network graph to rank the candidate keyphrases clusters.Finally,it extracts the most representative candidate keyphrases from the highest ranked clusters as text keyphrases.Compared with single words,the topics formed by clustering can better represent the content of the article and eliminate the semantic duplication caused by single words.Similar to TextRank algorithm,TopicRank algorithm essentially uses the word frequency information of text candidate keyphrases,but the method of using word frequency information has been further improved.This paper argues that in addition to word frequency information,other statistical features in the text,such as the length of a candidate keyphrase and its position in the article,also have a significant impact on whether it can represent the content of the article.At the same time,when the TopicRank algorithm clusters the candidate keyphrases,the clustering basis is the word form similarity of the candidate keyphrases.This paper thinks that this clustering method will favor the longer candidate keyphrases,and ignore the shorter keyphrases,which can not achieve the real clustering effect.Therefore,this paper uses two methods to improve the TopicRank:the first method adds the word vector clustering process on the basis of the word form clustering of TopicRank to achieve the purpose of clustering according to semantics;the second method effectively integrates the statistical features such as the length and location information of candidate keyphrases into TopicRank to form TopicLPRank.Combining the graph model and statistical methods in unsupervised keyphrases extraction algorithm,this paper explores an effective method to further improve the accuracy of text keyphrases extraction.In order to prove the effectiveness of TopicLPRank,we conduct several comparative experiments on multiple datasets of different sizes and types.The outcomes of experiments demonstrate that the process of adding word vector clustering can effectively increase the clustering effect of TopicRank,and combining the length or location information of candidate keyphrases can improve the F1 value of the model by more than 1.5 in the best case,which is equivalent to more than 10%of the original model effect.At the same time,fusing the length and location information of candidate keyphrases can increase the F1 value by more than 3.5,which is equivalent 21%to the original model effect.Experimental results demonstrate that TopicLPRank is effective and feasible for keyphrases extraction.To sum up,aiming at the problems and shortcomings of the text keyphrases extraction algorithm TopicRank,this paper proposes two improved methods,and carries out sufficient experiments.Experimental results show that the two improved methods proposed in this paper can effectively improve the ability of TopicRank for extracting keyphrases,and verify the effectiveness of the proposed methods.
Keywords/Search Tags:phrase extraction, TextRank, TopicRank, TopicLPRank
PDF Full Text Request
Related items