| Entity Linking(EL)is a fundamental task in the field of natural language processing,which aims to match and link entities in text with those in a knowledge base,providing a deeper semantic understanding of natural language text.Typically,it consists of three steps:named entity recognition,candidate generation,and candidate ranking,with the latter two being the key steps in entity linking.In recent years,with the development of deep learning techniques,entity linking has achieved high precision in linking entities in long text contexts,leveraging rich contextual information and explicit thematic information.However,in the field of Chinese short text,it still faces significant challenges due to the lack of contextual information,non-standard expressions,and the characteristics of the language itself.Currently,for candidate generation,Bi-Encoder-based methods mainly utilize two independent encoders to map entity mentions and all entities in the knowledge base to semantic space,and retrieve candidate entities by maximizing dot product similarity.However,they do not explicitly consider the differences in input representation between entity description information and short text,as well as the distribution differences of generated vectors in the vector space.For candidate ranking,mainstream methods utilize BERT to mine semantic relationships between text pairs from a text matching perspective,but there exist efficiency issues due to the large number of BERT parameters,as well as information loss issues caused by the simple processing of entity description information.To address these issues,this thesis focuses on improving the performance and efficiency of the two-stage entity linking method.The main research content and innovative points are as follows:First,we propose a candidate generation method based on the Sia BERT-CG model for dense vector retrieval.The algorithm utilizes a shared-parameter Siamese BERT model to reduce the difference in representation between short text and entity information,and to obtain a more suitable distribution for retrieval in the vector space,thereby improving entity retrieval effectiveness.At the same time,this thesis proposes a type-aware negative sampling strategy that ensures a balanced category and semantic similarity between negative and positive examples,increases the difficulty of model training,and improves the model’s ability to distinguish samples with ambiguity.Second,we propose a candidate ranking method based on ALBERT latent topic clustering.The ALBERT model is introduced to perform candidate ranking for short texts,capturing the matching patterns between entity mention items and candidate entities,and overcoming the issues of massive parameter size and model complexity in the BERT model.Additionally,a method for splitting entity description information into potential topic clusters was designed,which allows for full interaction between short texts and entity description information while reducing redundancy caused by text splitting.This improves the matching accuracy of the candidate ranking model.Third,a Chinese entity linking system is designed and implemented,which verifies the effectiveness and practicality of the proposed methods in real-world application scenarios.The system mainly consists of a Web interaction module,an offline processing module,and an online processing module,which can achieve entity recognition,candidate generation,entity linking,and API access and user registration and login scenarios for texts and text files.Finally,through comprehensive analysis of all test results,the system is verified to have outstanding functionality and practicality.In summary,this thesis addresses the issues in the two stages of entity linking methods in the Chinese short text entity linking field by introducing new model structures,optimizing the model’s training and prediction strategies,and improving the effectiveness and practical application ability of the model.Through detailed experimental verification and application scenario testing,the proposed methods in this thesis are proven to be feasible and effective. |