Since the beginning of the 21st century,with the development of information technology and the popularity of mobile terminals,the Internet has been generating massive data every minute and every second.At the same time,the problem of information overloading has become increasingly prominent.In particular,the extraction and filtering of a large amount of unstructured text data poses a great challenge to scholars and engineers.Automatic keyword extraction technology is an efficient solution for text data extraction and filtering.It has been widely used in information retrieval,search engine,natural language processing and other fields,and is an important starting point to achieve accurate matching between users and information.TextRank algorithm is one of the most commonly used techniques for automatic keyword extraction,and its essence is an undirected and unweighted graph model.Because of its lightweight body and good performance,TextRank algorithm is highly focused.The traditional TextRank algorithm uses the co-occurrence feature of text word items to construct the topology structure of the graph,and its model effect still can be improved.Some scholars have improved the performance of TextRank algorithm by adding more complex advanced features or integrating more text information.This paper presented the Sim-TextRank algorithm,which is a kind of TextRank algorithm presented in this paper on the basis of further joined the use of word vector Word2Vec builded vocabulary semantic similarity and topic model LDA builded vocabulary topic similarity.Through the scientific paper and news corpora,the corpus of the experimental results show that join the information from the topics and semantic similarity TextRank helped to raise the accuracy of algorithm of capturing keywords.Also this paper introduced suitable hyper parameters for the Sim-TextRank algorithm,compared the semantic similarity and the topic similarity. |