Font Size: a A A

Research On Information Retrieval Model Based On Learning To Rank

Posted on:2013-07-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:F ChengFull Text:PDF
GTID:1228330395455215Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Information Retrieval is a procedure of storing information and finding out the information which user need. The key problem of it is how to create a retrieval model that can sort documents (web pages) based on their relevance to the given query. Early retrieval models are very simple and only have a few parameters to tune, but their precision is low which cause the user dissatisfied with the retrieval results. To solve this problem, recently, methods of learning to rank have been applied to the retrieval model construction in order to get promising results. The so called learning to rank is to automatically create a retrieval (ranking) model by using labeled training data and machine learning techniques. For its wide usage in the area of document retrieval and collaborative filtering, the research of learning to rank has received more and more attentions, and become a hot research field in machine learning.The aim of this dissertation is to explore the theories and mechanism of retrieval model based on learning to rank and design effective algorithms from the view of Pairwise, Pointwise and Listwise, so as to obtain more accurate ranking model. The main research works in this dissertation can be summarized as followings:(1) For the problem that the ranking model obtained by traditional Ranking SVM is not accurate when measured by NDCG, an improved ranking algorithm is proposed. The algorithm firstly designs a query level framework, and then defines an objective function orient to NDCG. To solve the problem that the objective function is non-smooth, the algorithm presents to use the cutting plane algorithm. Experimental results on the benchmark datasets prove the effectiveness of the proposed algorithm.(2) The existed ranking algorithms based on direct optimization of rank measure adopt either Pairwise method or Listwise method, but lack of considering the method of Pointwise. To fill the gap, two novel ranking algorithms based on Pointwise method are proposed, these two algorithms are both optimization of NDCG, while have different objective functions and use different optimization technique. Specifically, the first one is under the framework of margin-rescaling, then designs a convex objective related to NDCG. Faced with this non-smooth function, the algorithm presents to employ cutting plane algorithm to overcome it. For the problem that the existed cutting plane algorithm may lead the master problem to decrease non-monotonically and converge slowly, an efficient line search algorithm is designed to improve the choices of cutting planes. The second algorithm adopts the framework of slack-rescaling, and designs a non-convex objective oriented to NDCG, which is tighter than the convex ones. Faced with the difficulty that the object function is non-convex and non-smooth, the dissertation proposes to use Concave-Convex Procedure and Cutting Plane Algorithm to overcome. Results on the benchmark datasets prove the effectiveness of these two proposed algorithms.(3) The dissertation proposes a novel ranking algorithm which combined Listwise method and Pairwise method. The algorithm divides the process of learning to rank into two stages. The first stage is a "base learner", which uses Listwise method. In this stage the algorithm selects one-slack SVM as tool, and defines an optimization objective related to NDCG. In order to solve the problem of exponential size of constraints in the optimization, a cutting plane algorithm is introduced. For the sub-problem of finding the most violated constraint, we address it by Quick Sort. The second stage is a refinement of the first one, which belongs to Pairwise method. In this phase, a non-convex sigmoid loss function is proposed which guarantees the result is a local maximum of the first stage. Empirical studies on benchmark collection show that the model learned by the proposed algorithm is not only more accurate but also more stable than those learned by Listwise method or Pairwise method.(4) For the problem that Ranking SVM is sensitive to noise and outliers, the dissertation proposes to use non-convex Ramp Loss to suppress the influence of outliers. Specifically, we present two algorithms to solve above problem. The first one directly designs a novel non-convex objective function based on Ramp Loss. For the problem that the objective function is neither convex nor smooth, firstly, we use the Concave-Convex Procedure (CCCP) to approach original function, then present an Online learning algorithm to solve it. The second algorithm applies selective sampling technology into pretreatment of data sets, which means using the Ramp Loss as a filter and discard the data that maybe outliers, then the algorithm is trained by the rest of data. Results on the different datasets show those two algorithms we proposed have significant robustness to outliers, which can generate more accurate ranking models, and at the same time, because of their less support vectors, the algorithms also reduce the training time.As a crossover issue between information retrieval and machine learning, the works in this dissertation are very important. On one hand, the research results of the dissertation can be directly applied to the area of document retrieval, collaborative filtering, experts finding, sentiment analysis, spam etc. On the other hand, the theories and mechanism proposed in the dissertation also provide the technical support to other related research such as natural language parsing, sequence alignment, label sequence learning.
Keywords/Search Tags:Machine Learning, Information Retrieval, Learning to Rank, RankingModel, Cutting Plane Algorithm, Support Vector Machine
PDF Full Text Request
Related items