Font Size: a A A

Research On Automatic Keyphrase Technology In Academic Corpus

Posted on:2024-09-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:T H LiFull Text:PDF
GTID:1528307064475174Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the advent of the Internet and information technology,the method that clients obtain information and data has gradually shifted from traditional methods to the cloud.In some fields,data acquisition is almost entirely dependent on online databases,with academic texts being the most representative.As of now,the Google Scholar online database has collected hundreds of millions of academic documents.Keyphrases are important tags made up of words or phrases that summarize the core content of a text,serving the functions of retrieval and guiding reading.In most academic papers,authors provide a set of self-annotated keyphrases.However,there are still many academic texts that do not have appropriate keyphrases or only have low-quality keyphrases,such as early papers,science articles,and technology news.Automatic keyphrase technology uses computer technology to annotate these texts with high-quality keyphrases,saving time and manpower required for manual secondary annotation and providing effective tags for information retrieval in academic databases.Automatic keyphrase technology is a technique that uses computer technology to automatically extract a representative set of phrases or words as keyphrases from a text.The research in this field is mainly divided into unsupervised keyphrase extraction and supervised sequence-to-sequence keyphrase generation.Unsupervised keyphrase extraction models are compact,simple in structure,and require low computational resources.Nevertheless,they are unable to extract absent keyphrases that do not occur in the source text.Conversely,supervised keyphrase generation models require higher computational requirements,complex model parameters,and larger training datasets,yet they are more accurate and therefore can produce absent keyphrases.Models based on graph data structures are highly regarded in the unsupervised keyphrase extraction field and are the main research trend in the current unsupervised domain.Regarding keyphrase generation,numerous models based on Recurrent Neural Networks(RNNs),Generative Adversarial Networks(GANs),and Transformers have been presented with the development of various sequence-to-sequence frameworks.This thesis presents three problems that still exist in automatic keyphrase technology and provides corresponding optimization solutions.In the field of unsupervised keyphrase extraction,current models commonly suffer from keyphrase duplication,which means that extracted keyphrases frequently contain the same highscoring terms.To address this issue,this thesis proposes an unsupervised keyphrase extraction model based on the fusion of three features to alleviate keyphrase duplication from a modeling perspective.The main contributions of this paper include the following three aspects:1.Proposed a model named Triple Rank for unsupervised keyphrase extraction based on feature fusion scoring.To effectively reduce keyphrase duplication,it models and scores three features separately: keyphrase diversity,keyphrase coverage,and position information.And then hierarchically fuses them.In addition,it saves inference time by avoiding iteration in the graph data structure.Compared with baseline models using four datasets,Triple Rank has superior performance and can alleviate keyphrase duplication problems.2.Proposed an optimizer called C-Decay for unsupervised keyphrase extraction models,which employs an auto-regressive structure.It solves the problem of lacking utilization of mutual information among extracted keyphrases,and can significantly improve the performance of keyphrase extraction models based on graph data structures.Through experiments combining four datasets and three baseline models,it was demonstrated that C-Decay has a significant optimization effect.3.Explores the characteristics of absent keyphrases in academic text datasets and proposed a new classification standard and evaluation method for absent keyphrases,and conducted empirical on three widely used training paradigms.Based on the research,the main cause of the low quality of absent keyphrases generated by deep generation models is discovered,and a joint model that can generate high-quality absent keyphrases is proposed.
Keywords/Search Tags:Automatic Keyphrase Technology, Information Extraction, Text Generation, Unsupervised Learning, Graph-Based Data, Pre-Trained Model
PDF Full Text Request
Related items