Multi-Feature Keyword Extraction Algorithm Based On Dependency Parsing

Posted on:2024-07-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y R Li

Full Text:PDF

GTID:2568307127960449

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Keyword extraction is a technique to extract some words that are most relevant to the meaning of a text.It is a basic work of text mining research,such as literature retrieval,automatic abstracting,document classification and text analysis.In this paper,the traditional keyword extraction model relies too much on word frequency,fails to recognize combinational neologisms and polysemy of one word,so this paper proposes a new keyword extraction algorithm based on syntactic structure and a multi-feature keyword extraction algorithm based on dependency parsing.The specific research contents are as follows:(1)Aiming at the problem of misjudgment of high-frequency modifiers caused by excessive dependence on word frequency in traditional keyword extraction models,this paper improved the TF-IDF method and proposed a syntactic structure-based keyword extraction algorithm(KEDP).It can be seen from the working principle of TF-IDF method that it takes the product of word frequency and inverse document frequency as the TF-IDF value.The larger the TF-IDF value is,the more likely it is to be a keyword.As a result,some high-frequency modifiers are mistakenly selected as keywords,seriously affecting the accuracy of keyword extraction.To solve this problem,syntactic rules and Pyhanlp dependency relation extraction tools are introduced in this paper,and a text syntax binary tree based on syntactic rules is constructed to identify the modifiers of dependency relationships.By traversing the text syntax binary tree,high-frequency modifiers in the candidate keywords of TF-IDF are filtered and the final keywords are obtained.The algorithm fully considers the syntactic information in the text,and can effectively eliminate the interference of irrelevant words,so that the extracted keywords are more accurate.(2)Aiming at the problem of not recognizing combinatorial neologisms and polysemy of one word in traditional keyword extraction models,this paper proposes a multi-feature keyword extraction algorithm(MFKEDP)based on dependency parsing.Similarly,taking TF-IDF method as an example,because this method relies heavily on word segmentation results,some combinative new words are incorrectly divided into two words during word segmentation,which makes combinative new words unrecognizable.In addition,because TF-IDF method does not distinguish polysemous words,the polysemous meaning of the word is ignored.To solve these problems,this paper adopts the idea of two-way optimization.The first route constructs dependency syntax trees to obtain the set of fixed syntactic rule words,and focuses on identifying the definite phrase and the phrase in the predicate,which serve as the basis of word combination and retain the corresponding complete Chinese sentence structure.The second route combines the TF-IDF value,part of speech,location and semantic features of words to obtain the information of multi-feature words.The multi-feature words with rich word information are the dominant words in the dependency pairs,which echo with the subordinate words in the word set of the established syntactic rules and form the connection between the dependency syntactic relations.The first and second paths are in parallel.Finally,the two word sets are combined and optimized to get the final keywords.This algorithm can identify combinational neologisms effectively and solve the polysemy problem from the perspective of dependency parsing.(3)Experimental analysis and comparison.At present,the evaluation of keyword extraction algorithm is mainly based on the quality of the extracted keywords.In this paper,the accuracy(P),recall rate(R)and F value are used to evaluate the effect of keyword extraction by referring to the evaluation method of disorderly results by Manning et al.The experimental data was collected from websites such as Sina.com and Chinanews.com,covering 10 fields including sports,finance,education,science and technology,games and current politics.In the experiment,KEDP algorithm and MFKEDP algorithm in this paper are firstly compared and analyzed with traditional TF-IDF algorithm and Text Rank algorithm,and then the high-frequency modifier filtering quality and keyword extraction quality of KEDP algorithm are tested and analyzed.Finally,the ability of MFKEDP algorithm to recognize combinatorial new words,solve the polysemy problem of one word,and the quality of keyword extraction under the condition of multi-feature combination were tested and analyzed.Experimental results show that the proposed KEDP algorithm can effectively eliminate the interference of irrelevant words and make the keywords extracted more accurate.The MFKEDP algorithm preserves the complete structure of Chinese sentences through dependency syntactic tree and integrates the information of multiple feature words,so that it has advantages in identifying combinatorial new words.It also solves the polysemy problem of one word to a certain extent from the perspective of dependency syntactic analysis.

Keywords/Search Tags:

Keywords extraction, Dependency parsing, TF-IDF, New word recognition

PDF Full Text Request

Related items

1	Word Sense Disambiguation Research Based On Dependency Parsing
2	Research On Mongolian Dependency Parsing Based On The Conversion Of Chinese-Mongolian Dependency Parsing Tree
3	The Research Of Multi-feature Word Sense Disambiguation Based On Dependency Parsing
4	Topic Recognition Of Policy Texts Based On Dependency Parsing
5	Research On Question Keywords Extraction Techniques For Question Answering
6	Research On Graph-based Chinese Dependency Parsing
7	Research And Implement On Chinese Dependency Parsing
8	Chinese Multiword Expression Extraction And Application On Chinese Dependency Parsing
9	Chinese Dependency Parsing Based On Deep Learning
10	Improving Word Vector Model With Part-of-Speech And Dependency Grammar Information