| Keyphrases are the refinement of document subject information,with the help of keyphrases,the subject of document can be obtained quickly.The research can be used in subdivision fields,for example text retrieval,text classification and text subject segmentation.The traditional graph-based keyphrase extraction methods fail to comprehensively consider the semantic similarity and topic diversity of keywords when using word embeddings to represent semantic features,and the cohesion of words used in phrase scoring is insufficient.This thesis mainly studies how to use the semantic information of words and the cohesion of words in phrases to combine multiple attributes in the graph model to extract keyprases.Specific work includes:Based on the Page Rank algorithm,word embedding is used to calculate the comprehensive semantic weight of keywords in the text,and the restart probability of the Page Rank model is modified.Aiming at the shortcomings of the existing graph-based extraction methods that ignore the use of word embeddings to express the comprehensive semantics of words in texts,this thesis designs a word scoring algorithm SDRank that combines semantic similarity and topic difference.The algorithm divides words into different attributes according to the appearance of the word in sentences of the text,calculates the semantic similarity and topic difference between the candidate keywords and words with different attributes,and obtains the comprehensive semantic weight of the candidate keywords in the text.Modify the restart probability of the Page Rank model using comprehensive semantic weights to improve the score of keywords.The experimental results show that the algorithm combining the two semantic features is more stable on the dataset;all evaluation metrics are improved compared with the traditional extraction method.Combine the frequency and position features of the words inside the phrase to score the candidate key phrases to extract keyphrases,reducing redundancy in the extraction results.Aiming at the shortcomings of the scoring results of the candidate phrases in the traditional graph methods that are easily affected by the scoring of words and there are many redundant phrases in the extraction results,a scoring method that integrates the frequency,position and length attributes of phrases is proposed,which only relies on the statistical features of phrases,and distinguishes redundant phrases and key phrases by measuring the number of co-occurrences of words in phrase and the position of words that appear frequently in phrase.Experimental results show that the scoring metrics of the proposed method on the three datasets are beyond the traditional phrase scoring methods,and the two features proposed can better extract keyphrases. |