Font Size: a A A

Research On Keyword Extraction Method Based On Document Topical Structure And Word Graph Iteration

Posted on:2020-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:M Z SunFull Text:PDF
GTID:2428330590972568Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the internet,online text information has grown exponentially,and how to accurately and quickly locate the required information in massive information has become particularly important.The keyword is the smallest unit that can represent the information of the document content.It can concisely express the main purpose of the document and become the main tool for people to quickly grasp the content of the document.The traditional way of obtaining keywords is that experts mark the documents;in the face of today's massive network texts,manual labeling has become unrealistic.The society urgently needs to automatically mark the keywords through the computer;therefore,the keyword automatic extraction technology has gradually become a research hotspot;at the same time,the keyword automatic extraction technology is widely used in search engines,news services and other fields to realize information retrieval and text.And it is the basis of tasks such as automatic summary generation,text classification and clustering.Therefore,this paper proposes a keyword extraction method based on document topic structure and word graph iteration to improve the accuracy and recall rate of keyword extraction.This paper first describes the background and significance of the topic,summarizes the research status of keyword extraction at home and abroad,and then briefly introduces the basic theory of this paper: clustering algorithm,LDA topic model and complex network model;then based on internal information of the document,the word clustering result of the document is used as a node in the word graph to construct a fully connected network graph for keyword extraction.This method improves the keyword coverage of the keyword to a certain extent and reduces the candidate word redundancy phenomenon;Limited to the shortcomings of insufficient information provided by documents,a method based on multi-document topic structure and word graph iteration is proposed,which comprehensively considers multi-document topic information and single-document internal structure information,and uses topic model modeling results to change word graph structure to achieve more effective extraction of keywords.Finally,the paper uses the crawled network text data to carry out corresponding contrast experiments on the proposed two models,and verifies the validity and superiority of the proposed model.The specific innovations are as follows:(1)Based on the internal information of a single document,the similarity of the candidate keywords in the document is calculated on the Wikipedia's Word2 vec model,and the candidate words are clustered by the clustering method,and the clustering result is used as a node of the word graph,construct a fully connected network map to sort the node.This method reduces the redundancy of candidate keywords to a certain extent,and improves the topic coverage and extraction accuracy of keywords.(2)Comprehensively use topic model and document structure information,model multiple documents through topic model,change the weight of word graph nodes and random jump probability,solve the problem of limited single document information,and improve the precision and recall of keyword extraction.
Keywords/Search Tags:keyword extraction, TextRank, LDA, graph model
PDF Full Text Request
Related items