Font Size: a A A

Research And Implementation Of Entity Recognition And Linking System

Posted on:2018-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhengFull Text:PDF
GTID:2348330518995691Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, texts from news, blogs,and social media are exploding, how to mine valuable information from the massive unstructured or semi-structured texts is an important task in the field of Natural Language Processing. Entity is an important unit of information in the text,and it plays a key role in understanding the text torecognize and analyze the entity correctly. As the key technology of processing entities, entity linking has attracted the attention of scholars all over the world.Entity linking is the process of linking name mentions in text to their referent entities in a knowledge base. In natural language, the entity has the characteristics of ambiguity and diversity, the main task of entity linking is to recognize mentions, which may have different forms and types, and to eliminate the ambiguity of mentions by linking them to a specific entity in the knowledge base. Entity linking is a very important technology in information extraction, query understanding, question answering and so on.In recent years, with the popularity of the knowledge base and the development of the knowledge graph technology, entity linking is becoming more and more important.In this paper, after analyzing the main problem facing entity linking in KBP(Knowledge Base Population) 2015 TEDL(Tri-Lingual Entity Discovery and Linking) task, and improving previous methods, we proposed an approach of entity recognition and linking based on random walk with restart. Firstly, mentions from text in three different languages are recognized and expanded depending on context. Then candidate entities are retrieved from the knowledge base, and compute semantic similarity between candidate entities and semantic similarity between mention and candidate entity. Finally, perform the approach of random walk on the graph constructed by mentions and the candidate entities, and get the probability distribution of entity and mention respectively, select the entity which is most similar to the mention as the linking entity. The F value of the method in the TEDL task is 0.665, higher than that of other systems and ranking first. The experimental result shows that the method can effectively improve the performance of the system.The main contributions of this paper are as follows:1. This paper indexes the knowledge base. Traditional methods use string matching to retrieve entities resulting in very low efficiency. In this paper, we build index for knowledge base, design and implement more reasonable and improved search strategy to make the system more rapid and flexible to retrieve candidate entities from knowledge base2. This paper expands the mention-entity graph. Previous methods only use mentions and candidate entities to construct graph, resulting in disconnected graph and poor performance. In this paper, we use relations between entities from knowledge base to expand graph, avoiding the problem that the graph is disconnected and improving graph's ability of semantic expression.3. This paper uses topic information to perform entity linking across different texts. Most studies focus on single text and ignore using the similarity of topic distribution between texts to enrich context of mentions,reducing mention's ambiguity. In this paper, we use LDA (Latent Dirichlet Allocation) topic model to cluster texts which have the similar topic distribution and perform entity linking across different documents.4. This paper uses easy-first strategy, which bases on the ambiguity of mention. Traditional methods don't take into account the mention's order and use linking result to prune mention-entity graph. When the graph contains too many irrelevant candidate entities the performance of algorithm is poor. In this paper, we firstly link mention with smallest ambiguity and use the linking result to prune the graph, so as to exclude the irrelevant candidate entities, improving the performance of random walk algorithm. Then select the smallest ambiguous one to link in the remaining mentions until all mentions are linked. In this process, with the disambiguation of mentions, the semantic expression of graph is more and more accurate.
Keywords/Search Tags:entity linking, random walk, Freebase, topic model
PDF Full Text Request
Related items