Font Size: a A A

Algorithms for information extraction and dissemination on the World-Wide Web

Posted on:2007-08-28Degree:Ph.DType:Thesis
University:Polytechnic UniversityCandidate:Irmak, UtkuFull Text:PDF
GTID:2448390005972870Subject:Computer Science
Abstract/Summary:
While much of the data on the Web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. In addition, recently we have seen a great increase in the use of RSS feeds. RSS is a semi-structured data format, which allows web sites to syndicate their new content at a specified URL.; In this dissertation, we first investigate the information extraction problem, that is, how to extract the embedded data from web pages into a relational format in an easy and reliable manner. The information extraction is usually performed by software tools called wrappers. The manual construction of wrappers is tedious and error-prone, and fully automatic approaches are often not reliable enough. We follow the semi-automatic approach and describe a new wrapper generation system and framework based on active learning. Our goal is to minimize the user effort for training reliable wrappers through design of an interactive training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.; Millions of new pages appear on the Web every day, and most search engines limit users to searches against pages that already exist on the Web and in their index. However, with the help of RSS feeds, such new pages can be easily discovered. In the second part of this thesis, we describe algorithms for information dissemination on the Web. This task is performed by prospective search engines, which allow users to upload queries that will be applied to newly discovered pages. We focus on keyword queries, and present optimizations to enable matching of large numbers of query subscriptions against a stream of newly discovered documents. Our experimental evaluation shows that the proposed techniques can improve the throughput of a well-known algorithm by more than a factor of 20, and allow matching millions of subscription queries per second against hundreds or thousands of incoming documents per node.
Keywords/Search Tags:Web, Information extraction, Data, Algorithms
Related items