Algorithms for information extraction and dissemination on the World-Wide Web

Posted on:2007-08-28

Degree:Ph.D

Type:Thesis

University:Polytechnic University

Candidate:Irmak, Utku

Full Text:PDF

GTID:2448390005972870

Subject:Computer Science

Abstract/Summary:

While much of the data on the Web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. In addition, recently we have seen a great increase in the use of RSS feeds. RSS is a semi-structured data format, which allows web sites to syndicate their new content at a specified URL.; In this dissertation, we first investigate the information extraction problem, that is, how to extract the embedded data from web pages into a relational format in an easy and reliable manner. The information extraction is usually performed by software tools called wrappers. The manual construction of wrappers is tedious and error-prone, and fully automatic approaches are often not reliable enough. We follow the semi-automatic approach and describe a new wrapper generation system and framework based on active learning. Our goal is to minimize the user effort for training reliable wrappers through design of an interactive training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.; Millions of new pages appear on the Web every day, and most search engines limit users to searches against pages that already exist on the Web and in their index. However, with the help of RSS feeds, such new pages can be easily discovered. In the second part of this thesis, we describe algorithms for information dissemination on the Web. This task is performed by prospective search engines, which allow users to upload queries that will be applied to newly discovered pages. We focus on keyword queries, and present optimizations to enable matching of large numbers of query subscriptions against a stream of newly discovered documents. Our experimental evaluation shows that the proposed techniques can improve the throughput of a well-known algorithm by more than a factor of 20, and allow matching millions of subscription queries per second against hundreds or thousands of incoming documents per node.

Keywords/Search Tags:

Web, Information extraction, Data, Algorithms

Related items

1	Algorithms, theory and applications for information extraction from textual data
2	Research And Implementation Of Data Extraction Oriented To Knowledge Graph
3	Research On Web Data Extraction Technology
4	Related Studied On Information Extraction And Information Recommendation Based On Web Data Mining
5	The Design And Implementation Of Web Information Extraction System
6	Research On Efficient Web Data Extraction Technology Based On Visual Information
7	A Study On Feature Design Algorithms With Application To Image Annotation And Information Extraction
8	XML-based WEB Information Extraction System Research And Implementation
9	Information Extraction Research And Application From Network Data
10	Design And Implementation Of Web Information Extraction System SEU-WIE