Font Size: a A A

Theory And Key Techniques Of Entity Retrieval

Posted on:2015-06-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:1108330509961011Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the ubiquity of the Internet and the rapid development of information retrieval(IR) technologies, the scope of IR has gone far beyond document retrieval. Entity retrieval which aims at finding entities such as people, places, organizations and products given the user’s information needs has become a new focus of current researches. In order to solve a series of currently existing problems in entity retrieval, we did comprehensive and systematic researches on entity retrieval in this dissertation.Expert finding which is one of the most important tasks in entity retrieval was studied:1. A document-query-candidate association model was proposed. Classical expert finding models usually assume that the candidate and query are conditionally independent given the document. However, this assumption is usually invalid in real world applications, which makes the performances of classical expert finding models unsatisfactory. This dissertation proposed a topic model for expert finding. This model is based on the latent Dirichlet allocation model and does not need to rest upon the conditional independence assumption mentioned above. The performance of the proposed model was evaluated using the CSIRO Enterprise Research Collection. The results showed that the performance of the expert finding system could be largely improved by using this model.2. The document priors in entity retrieval models were studied and a Doc Rank-based expert finding model was proposed. In addition, Topic Rank algorithm was established in order to deal with synonyms in the documents. Latent Dirichlet allocation was used to extract topics of the documents. The document graph was then constructed by analyzing the topic distribution of each document. Finally, link analysis techniques were used to obtain Topic Rank-based document priors and a Topic Rank-based expert finding model was developed.3. The candidate priors in entity retrieval models were studied. The candidate priors encode the importances of the candidates and can largely improve the performance of the expert finding system if well designed. However, the candidate priors were usually assumed to be uniform in most of the studies, which means all the candidates are of the same importance. Apparently, this assumption is unrealistic. This dissertation proposed a topic-centric candidate priors model which could exploit the information in the whole data corpus to obtain more reasonable candidate priors.Because expert finding systems are not able to retrieve the relation between different entities which is of great importance for entity retrieval, this dissertation further studied related entity finding:1. Methods to extract entities in tables and lists were studied. The tables and lists in Web pages contained a large number of entities. However, named entity recognition tools usually fail to effectively extract entities from tables or lists due to the lack of context. This dissertation presented a new method for extracting entities from tables and list. Tables and lists were classified according to their characteristics and entities were then extracted by considering the fine-grained types of the entities.2. Entity filtering models were studied. The candidate entity lists obtained after entity recognition usually contain large amounts of noise. Thus we usually need to filter the candidate entity lists after entity extraction. This dissertation presented an entity filtering model based on document frequency. The proposed model is different from traditional filtering methods. This model mainly used document frequency information to perform entity filtering and has the characteristics of less calculation, real-time and high efficiency.3. An entity ranking method based on topic model was proposed. Latent Dirichlet allocation was first used to extract topics of the related documents. Candidate entities were then ranked according to the co-occurrence frequency between the candidate entities and queries.Finally, we studied entity name disambiguation models. The entity name ambiguity problem is very common in entity retrieval. An entity may have many different names while multiple entities may share the same name. This dissertation focused on person name disambiguation which is one of the most important branches in entity name disambiguation. We proposed a three-stage person name disambiguation model. At the first stage, we proposed a person name disambiguation algorithm based on the latent Dirichlet allocation model. The proposed algorithm and the hierarchical agglomerative clustering algorithm were then used to disambiguate person names respectively. At the second stage, we used the voting model to integrate results from the first stage to get high purity results. Finally, we used an agglomerative model to further improve the inverse purity of the results.
Keywords/Search Tags:entity retrieval, expert finding, related entity finding, topic model, latent Dirichlet allocation, entity extracting, entity ranking, entity name disambiguation
PDF Full Text Request
Related items