Font Size: a A A

Study Of Web Information Retrieval Based On Structure And Subject

Posted on:2008-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:J J LiuFull Text:PDF
GTID:2178360242967342Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the quantity of web pages on the Internet increases exponentially. One popular method to get the relevant information from the mass information storage is using search Engine. But a great many difficulties are brought to the development of search engine because the information on the Internet is too vast. How to deal with these great capacities online and return more relevant information for a user quickly has become an urgent and meaningful hot spot for discussion.This paper mainly studies current Web IR technology based on structure feature and topic information of both HTML and XML documents.The former developes maturely and the latter is developing currently.Firstly, the development of information retrieval (IR) technology is introduced regarding briefly. And principle of operation, research hot spot, categorization and evaluation of search engine which apply IR technology to Web succssfully are followed. Because of the low precision of current information retrieval methods in HTML IR domain, this paper proposes an algorithm which exploits hyperlink between Web pages and anchor texts to rerank retrieval results with consideration of Web structure information to improve current ranking methods.The experimental result has proved that the new algorithm has much higher precision and recall.Secondly, a great many retrieval results which are showed by ranked list are obstructive for users to browse. This paper proposes a method of automatically classifying results to different categories using extended hyperlink algorithm. In this way, users can browse retrieval results according to the subject which they are interested in. The experimental results have proved this algorithm can improve the quality of Web pages categorization and perform better in SEWM2007 Chinese Web pages Categorzation Evaluation. Also, this paper studies on XML IR based on traditional HTML IR theory. And it proposes a ranking method considering XML document structure characteristic and users' query subject by using combing strategy and topic categorization. It proved this method can improve XML IR quality and perform better than other results in INEX2007.This paper does further study and discussion on current Web IR and also solve some existing problems. In future, further research and improvement need to be done.
Keywords/Search Tags:Search Engine, Link Analysis, Anchor Text, Hypertext Categorization, XML Retrieval
PDF Full Text Request
Related items