Keyphrase extraction is the task of extracting phrases from target texts that can effectively summarize the main content.It not only provides users with a summary of text at the phrase-level granularity but also serves as an additional feature that greatly influences the performance of downstream natural language processing tasks.Unsupervised text keyphrase extraction methods based on vector embeddings are popular research directions due to their strong interpretability,good extraction performance,and lack of dependence on annotated datasets.However,existing methods mostly use bag-of-words language embedding models and distance similarity to extract keyphrases,which treat keyphrases independently from the context of the text,ignore the contextual information of keyphrases in the text,and use the same calculation parameters for different texts,resulting in difficulties in summarizing the main theme of the text.In addition,with the development of pre-trained language models,supervised methods based on pre-trained language models have demonstrated powerful text keyphrase extraction capabilities.However,existing methods often simplify the extraction task into a sequence labeling problem and use binary classification methods to determine whether a candidate phrase is a keyphrase.This independent keyphrase discrimination method often overlooks the global semantic relationship between candidate phrases and the original text,resulting in the inability to extract highly summarizing keyphrases.Therefore,this thesis conducts the following targeted research:(1)In response to the issues of ignoring the contextual information of keywords and inability to process different features specifically in vector embedding-based unsupervised models,this thesis proposes an unsupervised text keyphrase extraction model based on bidirectional multi-granularity attention.In this model,the candidate features of phrases can use multi-granularity cross-attention to learn contextual semantic information from the native text features,and conversely,the native text features can use multi-granularity cross-attention to provide targeted attention scores for candidate phrase features.The attention scores are not only directly used for keyphrase extraction,but also used to weight candidate phrase features for downstream task prediction.The downstream task prediction loss serves as the supervision signal for model training.Corresponding comparative experiments demonstrate the effectiveness of the model and its core module.(2)In response to the insufficient consideration of the global semantic correlation between text and keywords in pre-trained language model-based keyphrase extraction methods,this thesis proposes a text keyphrase extraction model based on the dual-tower pre-training model.The model includes two independent BERT networks: the candidate matrix generation network and the text vector generation network.The two networks respectively embed features of candidate phrases and native text.At the same time,the model uses cross-attention to combine the features of candidate phrases and native text and achieve targeted scoring for candidate phrase features.The model has been validated for effectiveness on three test sets for text keyphrase extraction tasks.(3)Building upon the two aforementioned methods,a complaint information early warning system based on text keyphrase extraction is developed in this thesis.The system is capable of modeling complaint corpus in different scenarios,automatically detecting and classifying the corpus,and providing keyphrase information to assist users in verifying complaint classification. |