Post-supervised template induction for information extraction from lists and tables in Web sources

Posted on:2003-01-31

Degree:M.Sc

Type:Thesis

University:Dalhousie University (Canada)

Candidate:Shi, Zhongmin

Full Text:PDF

GTID:2468390011487962

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific format is time-consuming but straightforward, it is desirable to automatically generate extraction programs from examples of lists and tables in html documents. Supervised approaches have been shown to achieve high accuracy, but they require manual labeling of training examples, which is also time consuming. Fully unsupervised approaches, which extract rows and columns by detecting regularities in the data, cannot provide sufficient accuracy for practical domains. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involves the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our method achieves high performance with minimal user input compared to fully supervised techniques. For 14 typical web sources tested by our method, all lists and tables are correctly found and the average accuracy of extracting data field is almost 100 percent.

Keywords/Search Tags:

Lists and tables, Extraction, Web, Accuracy

PDF Full Text Request

Related items

1	Optimization And Implementation Of Waf Based On Reverse Proxy Integrated Black And White Lists
2	An Exploration of the Identifying Characteristics of Spam Campaign Address Lists
3	Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper Induction
4	Analyzing and extracting lists on the web
5	A Research On The Best-seller Lists Of Chinain The Last Ten Years
6	Research On Non-blocking Unordered Lists
7	A Study On China Recommended-book-lists(2005-2014)
8	Statistical tools for disclosure limitation in multi-way contingency tables
9	Discovering Relations Between Web Tables
10	Accuracy assessment of airborne LIDAR data and automated extraction of features