Font Size: a A A

Knowledge Mining For Web Information Retrieval

Posted on:2011-08-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:H J DiFull Text:PDF
GTID:1118360305466713Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, search engine has become an important tool for people to get the information they need. Nowadays, the core technique of most information retrieval (IR) systems is keyword matching, which has satisfied user's requirement to a certain degree. However, users'queries are usually short, informative and ambiguous, which greatly challenges current IR technology. Thus, it has become a hot research topic to mine knowledge from Web data to improve the performance of IR and build IR related service.In this paper, we focused on mining knowledge from user query logs and webpages aiming to improve the performance of IR and build IR related service, which includes query classification based on query logs, named entity mining from query logs and entity relation mining from webpages:1. Query Classification (QC) Based on Query LogsWe firstly studied the problem of QC. Since queries are usually short and ambiguous, a QC framework is proposed based on regularized correlated topic model (RCTM) to address these problems. This QC framework captures the relationship from queries and target categories by an intermediate taxonomy and a probabilistic topic model. In addition, this QC framework presented a feature expansion method to enrich the semantics of queries and target categories. Experiments on the KDDCUP 2005 data set show that our QC framework based on RCTM outperforms other baseline methods.2. Named Entity Mining (NEM) From Query LogsWe further studied the problem of mining named entities from query logs. Since queries are usually short (e.g.,2-3 words) and ambiguous, the methods in named entity recognition and classification (NERC) can not be directly and effectively applied to mine named entities from query logs. All these strongly challenge the research on mining named entities from query logs. We attempted weakly-supervised learning and supervised learning methods to mine named entities from query logs. The two methods are proposed from different modeling method to address the problem of mining named entities from query logs. The experimental evaluations on real query logs show that the two methods can mine named entities effectively from query logs and greatly outperforms other baseline methods. 3. Entity Relation Mining (NEM) From WebpagesDifferent from entity relation extraction in text, we studied the problem of related entity finding, which was proposed by TREC 2009 Entity Track. The overall aim of related entity finding (REF) is to perform entity-related search on Web data, which address common information needs that are not that well modeled as Ad Hoc document search. The main challenges of REF are how to build an effective framework to process huge dataset and how to model the relation between entities. In this paper, a novel framework was proposed based on a probabilistic model for REF in a Web collection. In this framework, a probabilistic model was built to model the entity relation and rank candidate entities. The experimental evaluations on TREC 2009 Entity Track dataset show the effectiveness of our REF framework based on the probabilistic model.Since the research on web information retrieval oriented knowledge mining begins very recently, our work promotes the development of this research field. While it need further study to improve the performance of IR and build IR related service based on knowledge mining.
Keywords/Search Tags:Information retrieval, query log, query classification, named entity mining, entity relation mining, topic model, semantic expansion, knowledge mining, transfer learning, supervised learning, weakly-supervised learning
PDF Full Text Request
Related items