Font Size: a A A

Similarity Learning For Heterogeneous Data And Its Application To Web Search

Posted on:2013-01-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:W WuFull Text:PDF
GTID:1118330362963432Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
This thesis studies the problem of similarity learning for heterogeneous data and itsapplication to web search. Similarity learning plays a crucial rule in applications likeweb search, collaborative filtering, image annotation, and machine translation. The tasksin all these applications can be formalized as learning and utilizing a similarity functionto match heterogeneous object pairs. The object pairs are query and document in websearch, user and item in collaborative filtering, key word and image in image annotation,and translations in two diferent languages in machine translation. Particularly, in websearch, the search engine performs matching between queries and documents. The rapidincrease in the amount of information available on the web makes search engines moreand more important in people's life. A search engine aims to retrieve relevant documentsto the queries issued by users and create ranking lists of documents based on the rele-vance between the queries and documents. Queries and documents are heterogeneousand their relevance is determined by the similarity between them. The similarity measureemployed by a search engine determines its performance.The thesis proposes defining the dot product in a Hilbert space as a similarity func-tion. Specifically, two mappings are defined for the two types of heterogeneous objects,and map the two objects into a common Hilbert space. After that, the dot product of their images in the common space is taken as a similarity function. Two approaches areconsidered to learn such a similarity function: (1) first learning two mappings, then cal-culating the dot product of images after the mapping; (2) directly learning a similarityfunction. Within each of the two approaches, the thesis aims to solve three problems:(1) how to leverage information from diferent sources. For example, in web search, thecontent of queries and documents as well as click through data, can be used to learn asimilarity function; (2) how to enhance the efciency and the scalability of the learningalgorithm such that it can handle large scale data; (3) how to analyze the generalizationability of the learning algorithm.In the thesis, the approach that learning two mappings and then calculating similar-ity as the dot product of images after the mapping is first considered. Particularly, Thetwo mappings are specified as two linear mappings, and the similarity function is repre-sented by a bilinear form. To learn the linear mappings, two types of hypothesis spacesare proposed. First, the columns of the linear mappings are required to be orthonormal.The assumption leads to a multi-view method that can efectively leverage informationfrom diferent sources. Next, to enhance the efciency and the scalability of learning,a regularization approach is proposed. The approach constrains the rows of the linearmappings with 1norm and 2norm. By doing so, the solutions are sparse and the paral-lelization of the algorithm is also facilitated. Finally, the thesis also systematically studiesthe generalization ability of the similarity learning method.Secondly, the approach that learning a similarity function by directly defining itshypothesis space is considered. Particularly, a kernel method in machine learning isemployed and a kernel based similarity learning approach is proposed. Specifically, thethesis defines a special positive semi-definite kernel named S-kernel. An S-kernel cangenerate a hypothesis space for similarity functions. The use of the kernel method canguarantee the optimal solution and its generalization ability. To improve the efciency oflearning, an online approximation of the algorithm is also proposed.The similarity learning methods proposed in the thesis are applied to web search.The thesis shows that with the approaches one can solve the"term mismatch"problemin web search. Experiments are conducted on large-scale real-world data in enterprise search and web search. The experimental results indicate that the proposed methodscan successfully overcome the term mismatch problem and significantly outperform thebaseline methods in relevance ranking and similar query finding.
Keywords/Search Tags:similarity learning, web search, dot product, bilinear, kernel method
PDF Full Text Request
Related items