Font Size: a A A

Research On Query Reformulation Based On Machine Learning

Posted on:2013-09-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:B Q WangFull Text:PDF
GTID:1228330395451181Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
To disclose users’ information need is an important research topic in text retrieval. Many problems need to be deeply explored, for example, the "vocabulary problem" means that the true relevant documents are not retrieved since those documents mismatch the query term. Sometimes, different keywords in a query are not equally important, which requires the retrieval system to assign an appropriate weight to each query term. And another problem is the "relevance feedback", which explores how to use the explicit and implicit relevance feedback to refine the query.This thesis is focused on how to reformulate the users’ query by putting the machine learning algorithm as the core in the research work. The author would like to study the problems described above from three different aspects, which is (1) the Query Expansion method based on supervised learning,(2) the query term weight assignment method based on linear classification model,(3) the Contrained Clustering algorithm designed for intergrating the explicit relevance feedback information.First, to deal with the problem that the query term mismatches the relevant document, the author proposes the query expansion approach based on machine learning, which trains a supervised model to select the expansion words. The advantage of this method is the ability to intergrate various features for a candidate expanded term, which are used to predict whether this word is good or bad. However, since the training dataset is hardly available, the author proposes an algorithm which can automatically generate the training dataset by making use of the answer set for evaluation. Based on this dataset generation algorithm, this thesis provides a detailed analysis on the auto-generated training data, which can direct the training process. The supervised expansion method gets a good performance on different benchmark evaluation dataset, and shows a big improvement compared to previous expansion method.Second, to deal with the problem of assigning each query term, the thesis proposes a framework based on the probabilistic classification model, which can transfer the term weight assignment task into the parameter estimation task in the supervised learning. The generative and the discriminative model is adopted respectively in the parameter estimation process. The experiments conducted on the TREC benchmark dataset indicate that the retrieval performance can be improved significantly with either the generative or the discriminative estimation method.Finally, in this thesis, the author focuses on the explicit relevance feedback. The author studies the constrained clustering algorithm which takes the explicit relevance information as the contrain in the pseudo-relevant document clustering process, so that the system can gather high quality pseudo-relevant documents, where a better new query could derive from. The author first conducte a simulated experiments with the benchmark dataset, then given the real relevance information from the users, the author explore the performance of the constrained clustering on large scale data set (CLUEWEB09). All the experiment results consistently demonstrate the effectiveness of the contrained clustering for the relenvance feedback.
Keywords/Search Tags:Query Expansion, Query Reformulation, Machine Learning, RelevanceFeedback, Constrained Clustering
PDF Full Text Request
Related items