Font Size: a A A

Research And Implementation On Chinese Information Retrieval System Based On Structured Vector Space Model

Posted on:2009-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:W P CaoFull Text:PDF
GTID:2178360242494088Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Information Retrieval (IR) is a procedure to extract related information and documents from data sets. The emergence of the Internet has provided a new way of information retrieval, with structured data gradually shifting to semi- structured, even non- structured data. It has been very difficult for traditional web information retrieval technologies to satisfy the need of high-quality results retrieved from increasing web texts. The main content of the thesis is to study a Web-based information retrieval algorithm.Firstly, this thesis briefly outlines the development of information retrieval technology, including analysis and comparison of keyword-based and hyperlink-based methods. To cope with low recall in keyword-based retrieval and topic drift in hyperlink-based retrieval, it proposes a new algorithm combining the two methods, which ranks the retrieval results based on hub and authority values from links between web pages as well as the relevant weight of each page by matching link anchor and document content with user query.Secondly, considering the characteristics of web information retrieval, the thesis proposes the concept of structured vector space model by analyzing some problems in traditional vector space model. The new model represents a web document as a logically structured vector, which contains several sub-vectors related to relatively independent parts such as title, subtitle, plain text and anchor text, etc.Thirdly, the thesis gives a detailed introduction to web pages collector and indexer as well as pertinent principles and techniques in web information retrieval systems. Meanwhile, it discusses some methods of how to denoise and extract themes from web content with page marked trees, and establishes an implementation to improve the quality, efficiency and compression ratio of web indexes.Finally, based on traditional information retrieval algorithms, the thesis designs and implements a web-based Chinese information retrieval system, which uses a combination of keywords-based and hyperlink-based retrieval algorithms by structured vector space model. In the evaluation of SEWM2007(Symposium of Search Engine and Web Mining 2007), it is shown that the searching algorithm used by the system can greatly improve the recall and the precision of web information retrieval.
Keywords/Search Tags:IR, search engine, vector space model, inverted index
PDF Full Text Request
Related items