Font Size: a A A

Research On Context-based Entity Linking Technique

Posted on:2015-11-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H GuoFull Text:PDF
GTID:1228330422492421Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The main purpose of entity linking is to identify which real world entity a name refers to in the context. Specifically, entity linking is to map a name string in given text to the referent entry in a knowledge base. If the referent entity has not been included in the knowledge base, return NIL mark. Recently, the National Institute of Standards and Technology (NIST) has held several entity linking centered international evaluations. Entity linking is useful in many natural language processing tasks, such as information extraction, question answering, machine translation and information retrieval.The major problem for entity linking is the ambiguity. That is, the same name can represent many entities and the same entity can have many names. The main work of entity linking is to improve the accuracy and efficiency. High accuracy entity linking re-sults can provide reliable entity objects for many other natural language processing tasks. And efficient entity linking system is the direct demands of Internet and the big data processing. Entity linking can be broken into two parts, including candidate generation and disambiguation. The context is the main evidence of candidate generation and dis-ambiguation and the core problem is how to effectively use the context to improve the accuracy and the efficiency. The context is the basis of the candidate generation and the disambiguation. This paper is based on the context of the entities.Previous entity linking work mainly focus on the disambiguation part. On the con-trary, there is few work focus on the candidate generation improvement. In fact, candidate generation is a essential part of the entity linking. If the candidate set does not contains the target entity, then the disambiguation part can not give the right result. So the recall of candidate generation is the upper bound of the recall of disambiguation. However, if the candidate set is too large, then the efficiency of the disambiguation will be harmed. So how to decrease the candidate set size under the constraint of the recall is the main purpose of the candidate generation. In this paper, we search for the coreference names of the query name in the context, in which way, we can get better recall of the candidate set. At the same time, we propose a novel similarity calculation method to filter the candidate set and obtains a smaller one. Experimental results show that our context based candidate generation method can obtain smaller candidate set and higher recall. The main goal of the disambiguation is to infer which entity the query name refers to in given context. We propose two solutions for the disambiguation problem. One is a context graph based disambiguation method, and the other is probabilistic disambiguation method based on the context of entity. The two methods solve the disambiguation problem from the context and the entity modeling, respectively.Traditional entity linking methods are mainly based on context similarity. However, this is not the way people disambiguate entities in context. People use the background knowledge for the semantic analysis of the context. Knowledge bases like Wikipedia contain many inter-links. Such links show the connection between knowledges, and at the same time, construct a directed graph. If the context around the entity can be modeled as a graph, we can use the graph structure in the knowledge base for the disambiguation. In this paper, we model the names in the context and the corresponding candidate entities as nodes in a graph, and connect the context graph structure and the knowledge base graph structure for the disambiguation. The experiment results show that such context graph based disambiguation method can achieve or be competitive to the-state-of-the-art methods in the micro averaged accuracy.The accuracy of the disambiguation system mainly depends on how detail the entities are represents. Usually the more training corpus for the entity modeling, the more details of the entity is represented. However, the popularity is different for different entity, and the quantity of the training corpus is different. Sometimes the different quantity on the training corpus is so serious that the accuracy of the disambiguation will be harmed. This paper proposes a probabilistic model for the data imbalance problem. This method is based on the smoothing technique in language modeling. On the other hand, this paper propose to use alias feature in the probabilistic model. Experimental results show that the smoothing technique and the alias feature can improve the system accuracy significantly.In current entity linking, the context mainly refers to the context around the target entity. However, for the short text such as microblog, the effective feature in the context is insufficient for the disambiguation. So the current entity linking system’s performance drops in the microblog posts. Although the information in a single microblog post is not enough, the information in the whole microblog platform is redundant. In this pa-per, we propose to leverage the relevance microblog post as cross document context for the entity linking. We propose a pseudo relevance feedback based method and a graph based method. The pseudo relevance feedback based method use the relevance post to expand the context of the query post to obtain more features for the disambiguation. The graph based method can overcome the problem of importing noise of the pseudo rele-vance feedback method by means of weight the expansion posts. Specifically, the graph based method model the candidate entities and the microblog posts as nodes in a graph and weight the edges between the nodes in the similarity between the nodes. This method use iteration algorithms to propagate the labels from the entity nodes to the post nodes. Experimental results show that both the pseudo relevance feedback based method and the graph based method can effectively improve the accuracy of the system and the graph based method performs better than the other one.In all, this paper focuses on the candidate generation and disambiguation problem in the entity linking task. We propose solutions from the point of context for the quality of the candidate set, context modeling, entity modeling and the context expansion. This paper have some progresses in the above problems. We anticipate such progresses can be used to promote the development of information extraction, automatic question answering and other natural language processing tasks.
Keywords/Search Tags:Entity Linking, Context, Candidate Generation, Disambiguation, Graph Mod-el, Probabilistic Model
PDF Full Text Request
Related items