Font Size: a A A

Studies On Entity Search And Resolution

Posted on:2013-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L JiangFull Text:PDF
GTID:1118330371985701Subject:Radio Physics
Abstract/Summary:PDF Full Text Request
Quickly and accurately searching the various entities (e.g., person names, organizations, locations, products, and drugs) from the unstructured or semi-structured data becomes more and more important in a wide range of applications, such as information retrieval, recommendation system, and social network mining. The survey in recent years shows that entity search accounts for a large part of the Internet queries, and this proportion has been rising. Compared with the words and n-grams, entities have a stronger ability to describe the context features, which can help users quickly get the key points of a document. However, with the increasing growth of the Internet data, entity search becomes more challenging, especially due to the tough problem of entity ambiguity. First, a number of different entities may have exactly the same name. For example, more than290,000people in China are named as "Zhang Wei"; given an entity name as query to a search engine, the top100results may refer to a number of different entities that share the same entity name. Second, a unique entity is often mentioned by a variety of forms (i.e., alias). For example,"the Republic of China" is well known as "China" or "P. R.C." and Liu Xiang has a nickname of "Asian night". In the pharmaceutical industry, the phenomenon that more than one drugs own the same name and a drug may have different variants is non-trivial and dangerous for medication.The entity name disambiguation and entity alias discovery are two relative procedures and closely related, which are known as the two most important problems in entity search and entity resolution. This thesis makes a survey on entity search over lots of previous research work, analyzes the different characteristics of data from different sources including surface networks, social networks and internal networks. Moreover, we propose effective solutions for entity disambiguation and entity alias discovery respectively. In addition, based on the solution of entity disambiguation, we develop a people search system, GRAPE. Moreover, we extend the proposed solution of entity alias discovery to adopt for the dynamic environments. The main contributions of this thesis are listed as follows:1. A survey on entity search. We present the various problems and solutions in entity search, and describe some exiting entity search systems. Moreover, some issues and future research directions about people search system are summarized. 2. Entity name disambiguation. Given a person name as query, we obtain some unstructured documents returned by the existing search engines (i.e., Google, Bing, or Bing). After that, we use a natural language processing tool to extract eight types of named entities from the documents as tags. Based on the extracted tags, an entity-relationship graph is established and finally these tags are grouped into several clusters, each of which describes a people entity uniquely. Additionally, a practical entity search system-GRAPE is deployed based on the proposed solution and presents a cluster of tags for different persons owning the same name.3. Entity alias discovery. We design a string match method to extract a few alias candidates for each given entity. Through exploring the entity relationships from both structured data and unstructured data, an entity-relationship graph is built and then we search the graph-based connectivity between a given entity and all its alias candidates. Finally a given entity is assigned a list of candidates. Moreover, to handle the aliases without string similarity with the original entity, we present a subset-based method to choose alias candidates and ultimately obtain a few aliases for each given entity through prediction by a logistic regression classifier.4. Dynamic entity alias discovery. With the data corpus updating, the corresponding entity-relationship graph is ever changing. However, the previous solutions based on static datasets are not applicable any longer. In this thesis, we propose an entity-index strategy for path searching in a dynamic graph and then apply this strategy in the real application of incremental entity alias discovery.
Keywords/Search Tags:entity search, entity resolution, entity disambiguation, entityalias, people search system, dynamic graph mining, entity-relationship graph
PDF Full Text Request
Related items