Font Size: a A A

Sifting and winnowing: Approaches to finding useful information on the web

Posted on:2004-03-17Degree:Ph.DType:Dissertation
University:The University of Wisconsin - MadisonCandidate:Zeidenberg, MatthewFull Text:PDF
GTID:1468390011971389Subject:Computer Science
Abstract/Summary:
The Web has started a social transformation in which information flows more widely and could be made more reliable. But we need tools to make this flood of information more useful.; I argue that web search engines, directories, and collaborative filtering systems should be combined into unified systems based on the general concept of information filtering. Such filtering can be used with Web pages, user reputations, and anything subject to a price or a poll. An information filtering mechanism can supplement or perhaps supplant markets and other institutions.; Computers are best at processing large amounts of textual information quickly and making a good first guess on such attributes of a page as what category it belongs in, what other pages it is close to, and its rank relative to the other pages. Communities supporting user reputations are best at final judgments.; A system that combines a Web directory and search engine can be built by spidering off the initial directory pages and using a multi-resolution version of the Naive Bayes algorithm that classifies the new pages in a scalable manner. I also show that links can add valuable information about what category in which a page belongs.; By collecting human judgments about a set of pages, I find only a weak relationship between these judgments and a count of in-links to these pages. If human judgments are what one is looking for, there is no way to get such judgments that is better than gathering them directly, through collaborative filtering.; I experiment with various ways of clustering web pages, first by using links and pruning out highly-referenced pages, second by using the WordNet electronic lexicon, and third by using semantic networks constructed from the document text. All of these methods are successful, to varying degrees.; I build a linear system for collaborative filtering with an explicit relation between document and author. Ratings of documents reflect on their authors; and these influence the weight given to their rating of other documents. Such a system is shown to be stable after relaxation.
Keywords/Search Tags:Information, Web, Pages
Related items