Font Size: a A A

Keyword Automatic Extraction Based On Similar Documents

Posted on:2020-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2518305972965009Subject:Information Science
Abstract/Summary:PDF Full Text Request
Keywords are the smallest unit that can characterize document features and topics.The keyword extraction task refers to automatically extracting words or phrases of topicality or importance from a document.Due to the continuous development of information technology and the explosive growth of text data,how to quickly and accurately locate text content and understand text information has become an urgent problem to be solved.Therefore,in the era of big data,keyword automatic extraction technology has gradually become an academic research hotspot in the field of natural language processing.The current automatic keyword extraction technology is still unsatisfactory,and the extracted keywords still have a large room for improvement in readability,relevance and coverage of the theme,and it is difficult to extract keywords that are not present in the document.In order to overcome the shortcomings of the existing keyword extraction technology and explore a simpler and more efficient keyword extraction method,this paper proposes a keyword extraction method based on similar documents.The basic idea of this method is that authors often choose keywords of similar documents that they have read in the past as keywords of their new documents,in order to improve the chances of documents being retrieved and to reduce the difficulty of understanding the documents.That is,documents with similar contents often have high keywords.Therefore,when extracting keywords from a document,the existing keywords of the document similar to the document can be used as external knowledge for reference.This method not only improves the performance of the method of extracting keywords from a single document,but also extracts keywords that are not present in the document.The method proposed in this paper consists of four parts: document internal keyword extraction,similar document selection,external similar reference documents with keyword fusion,and the integration of internal keywords and external keywords based on unsupervised methods.There are two stages: the first stage is to select the candidate keywords that meet the requirements from the document,and the second stage is to select a certain indicator to score the candidate keywords,sort the keywords according to the scores,and intercept the key points of the pre-sequence Words as keywords for documents.Similar to the unsupervised keyword extraction method flow,the method proposed in this paper calculates the keyword score according to a fusion method in the second stage.The core of the method lies in the selection of related documents,the fusion of internal and external candidate keywords,and the calculation of fusion scores based on the original scores.The main parameters involved in this method are the number of selected documents,the weight distribution of internal and external keyword scores,and the number of keywords after fusion.In this paper,we will set up multiple sets of experiments to analyze the extraction effects under different parameter settings,and summarize the optimal fusion parameter settings.The dataset used in this paper is a contains 567,830 metadata for high-quality scientific papers.Metadata comes mainly from major online digital libraries,including ACM Digital Library,Science Direct,Wiley,and Web of Science.The experimental results show that the proposed method is higher than the benchmark experiment set in this paper in accuracy,recall rate and comprehensive index F-measure,and can effectively retrieve keywords that do not appear in the document.
Keywords/Search Tags:Document similarity, Data Fusion, Keyword Extraction
PDF Full Text Request
Related items