Font Size: a A A

Study Of Text Information Retrieval Algorithms Based On Web

Posted on:2007-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:K Z FuFull Text:PDF
GTID:2178360182460596Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the number of documents on the Internet increases exponentially. One of important researches focuses on how to deal with these great capacities of online documents. Text information retrieval is a task that involves finding more relevant documents for a user query in a collection of documents. The algorithm of information retrieval is mainly studied based on web in the paper.Firstly, the development and technology is introduced regarding the information retrieval briefly in the paper; Based on this, the content-based algorithm, the link-based algorithm and fusion-based algorithm about the information retrieval are analyzed. In order to avoid low recall in content-based retrieval and topic drift characteristic in hyperlinks-based retrieval, a new algorithm based on hyperlinks and anchors is proposed which combines the content-based with link-based retrieval algorithm. In this algorithm, PageRank values are firstly calculated from the links between the web pages, then the relevant weight of each page can be gained considering PageRank and document content, and then the retrieval results are ranked The experiment results show that the new algorithm for IR has much higher precision and recall.Secondly, in order to improve the precision and reduce the retrieval time, above the traditional vector space model (VSM), an information retrieval algorithm is put forward based on N-level VSM in the paper in order to improve the similarity of content, meanwhile when the index is established; the algorithm of reducing noise in web is made use of in the paper. When the noise of irrelative of topic information is reduced, the efficiency of establishing index and the speed of retrieval are improved evidently, at the same time the space of storage is reduced greatly. Compared with traditional VSM, this new algorithm reduces time complexity and improves precision. The experiment results show that the new algorithm for IR has much higher precision and recall.Finally, taking advantage of the improved algorithms, a web-based text information system is implemented above the traditional information retrieval algorithm.
Keywords/Search Tags:Text Information Retrieval, Vector Space Model, Link Analysis, Recall, Precision
PDF Full Text Request
Related items