Font Size: a A A

Web Information Retrieval Based On Semi-supervised Manifold Learning

Posted on:2010-03-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:C WangFull Text:PDF
GTID:1118360302458560Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The dawning of Web 2.0 witnesses the gradual evolution of the World Wide Web from a vast information repository into a world-wide platform for user participation, sharing and interaction. This leads to a fast growth of heterogeneous data on the Web such as images, audio clips, video clips etc. as well as an increasing demand for personalized Web information retrieval. As a result, heterogeneous information and personalized user demands become two major challenges for Web information retrieval.In this thesis, we study Web information retrieval techniques based on semi-supervised manifold learning. Semi-supervised manifold learning aims to build a clssifying function by exploiting the intrinsic manifold structure collectively revealed by known labeled and unlabeled data. In many Web information retrieval applications, data of various types such as text, image, video etc. are represented by Vector Space Model (VSM) in which relevant data dwell on submanifolds embedded in the ambient space. Semi-supervised manifold learning techniques can therefore be effectively applied in these applications to make better use of the User Generated Contents (UGC) by learning from underlying manifold structure and enhance personalized user experience in Web information retrieval.In this thesis, we explore the application of semi-supervised manifold learning techniques in the following Web information retrieval tasks:1. Content-based image retrieval (CBIR). Relevance feedback is introduced into CBIR to bridge the "semantic gap" between low level features and high level concepts. This however brings a new problem known as "curse of dimensionality" into CBIR. To address this issue, a novel semi-supervised learning method for dimensionality reduction, namely kernel maximum margin projection (K.MMP) is proposed in this thesis based on maximum margin projection (MMP). After projecting the images into a lower dimensional subspace, KMMP effectively improves the performance of image retrieval. 2. Face retrieval in Web news. News mostly consist of stories about people; therefore, queries for text and images related to a specific person are desired. Regarding the high expense in labeling faces in news photoes, most existing approaches for retrieving faces in the news are unsupervised. In this thesis, we propose a new semi-supervised approach by ranking on face manifolds. By using only a very small amount of labeled faces, the proposed approach can achieve better precision and reduce the high error rate of the unsupervised approaches when there exists many negative samples of the same person in the dataset.3. Web page summarization. In social networks, tags on a Web page are both highly generalized descriptions of topics contained in it and annotations for the contents users are interested in. This makes tags on a Web page a good source for user-oriented summarization. In this thesis, we propose a graph-based social summarization approach that generates user-oriented Web page summary in two steps: (1) a weighted graph is derived from the tripartite collaborative tagging model by analyzing user tagging behavior; (2) user interest propagation on the weighted graph are performed using manifold ranking algorithm to generate a summary focusing on the contents users are interested in.4. Identification of Web news titles. Traditional Web news title identification approaches are template-based and therefore vulnerable to template updates. In this thesis, we propose a template-independent Web news title identification approach based on the visual features of the title in a news page. We first segment a news page into various blocks using VIPS algorithm and extract visual features and content features of the block. The title block dwells on a manifold in the resulted feature space. By ranking on the manifold data, we effectively identify the news title block.
Keywords/Search Tags:Web Information Retrieval, Semi-supervised Manifold Learning, Dimensionality Reduction, Manifold Ranking, Web Image Retrieval, Web Page Summarization, Face Retrieval
PDF Full Text Request
Related items