Font Size: a A A

Keywords Extraction Based On Word2Vec And TextRank

Posted on:2021-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:F ChenFull Text:PDF
GTID:2428330605964018Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Since the first time artificial intelligence robots defeated Go masters,computers and other high-tech technologies have achieved very good results and breakthroughs in every field.There is more and more text information on the network,structured data and unstructured data are very different in ease of processing,and the application and research of such text data are also very different.For two types of data,keyword extraction is still an important way to achieve intelligent text analysis.Traditional keywords are mainly tagged by experts and authors.However,due to the large amount of text on the Internet,traditional tagging methods are increasingly unable to meet the efficiency requirements.Automatic keyword extraction and tagging technology has become an important research hotspot in recent years.At the same time,automatic keyword extraction technology is also booming in many fields:information classification,information retrieval,automatic summarization,personalized recommendations,etc.This article is mainly based on computer literature,and proposes a keyword extraction model that combines Word2Vec and TextRank to improve the recall and precision of automatic keyword extraction.This article initially showed the research background and current status of keyword extraction technology;secondly it introduces the Chinese and English word segmentation technology,text representation method and Word2Vec and TextRank model;then based on computer literature,it proposes to combine external document information(Word2Vec model)and internal document information(TextRank model)for automatic keyword extraction;then this paper compares the traditional word frequency method-TF-IDF and word graph method-TextRank and the similar untitled factor method UNI-TextRank.In the experimental comparison process,we can find that the algorithm(W-TextRank)combining Word2Vec and TextRank proposed in this article has been improved in any aspect;finally,this article gives a summary of this work and several aspects of the current research model that can be improved.There are three steps in the main work of the model extraction in this paper:(1)Perform text preprocessing on the collected corpus resources,mainly save nouns,verbs,adjectives,etc.At the same time,use the keywords marked by the author as a word segmentation dictionary.(2)Use the Ship-gram model in the deep learning tool Word2Vec to train the text pre-processing external computer document word set into word vectors,and calculate the cosine distance between the word vectors.(3)Set the size of the sliding window in TextRank,and combine the internal document information(title and context)to determine the connecting edge between the nodes.(4)Construct a new probability transfer matrix and perform model fusion:calculate the distance between each word vector according to the Ship-gram model training to obtain the word vector,and use this distance as the new probability transfer matrix in the TextRank model.The specific innovations of its method are:(1)The use of Word2Vec training text increases the probability of the occurrence of keywords to a certain extent,strengthens the value of the connection edge of the probability transfer matrix in the subsequent network graph,and reduces the redundancy.(2)When designing the sliding window in TextRank,this article connects the internal document information(title and abstract)together,which improves the degree of semantic coherence in the text.In summary,the keyword extraction model combined with Word2Vee and TextRank proposed in this paper,combined with external document information and internal document information,can solve the problem of the semantic consistency of internal documents and the redundancy of candidate keywords.This article compares this model with the other three methods.From the experimental results,we can see that this model is much better to the other three models whether in precision or recall.
Keywords/Search Tags:Deep learning, keyword extraction, Word2Vec, TextRank
PDF Full Text Request
Related items