Font Size: a A A

TableSeer: Automatic table extraction, search, and understanding

Posted on:2010-05-08Degree:Ph.DType:Dissertation
University:The Pennsylvania State UniversityCandidate:Liu, YingFull Text:PDF
GTID:1448390002482865Subject:Information Technology
Abstract/Summary:
Tables are ubiquitous with a history that pre-dates that of sentential text. Authors often report a summary of their most important findings using tabular structure in documents. For example, scientists widely use tables to present the latest experimental results or statistical data in a condensed fashion. Along with the explosive development of the digital library and Internet, tables have become a valuable information source for information seeking and data analysis.;Interest in and use of table data necessitates table indexing and search. However, current search engines do not support table search. The difficulty of automatically extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make the table search problem challenging. Effectively and efficiently searching table data becomes an urgent demand.;In this dissertation, we present an automatic table extraction and search engine, TableSeer. TableSeer crawls the web and digital libraries, detects tables from documents using heuristic-based and machine-learning based methods, represents tables using an extensive set of medium-independent table metadata that other people can reuse, indexes table metadata files, ranks tables, and provides a user-friendly search interface. To improve the performance of the table boundary detection, a novel page-box-cutting method and a sparse-line detection method are proposed. Given a keyword-based table search query, TableSeer ranks the matched tables and returns the most relevant tables with a novel table ranking algorithm---TableRank. TableRank tailors the classic vector space model and adopts an innovative term weighting scheme by aggregating multiple features from three levels: the term, table and document levels.;Although tables are widely used, there is no standard on the table structure designing. Many issues that go into the design of tables and will impair the table data readability, accessibility, and reusability are ignored. In order to have a deep understanding on the table characterization and to improve the table extraction and search performance, we also implement the first large-scale table quantitative study on table natures in digital libraries.;We demonstrate the value of TableSeer with empirical studies on scientific documents. The experimental results show that our table search engine outperforms existing search engines on table search. Overall, TableSeer eliminates the burden of manually extracting table data from digital libraries and enables users to automatically examine tables. TableSeer is successfully deployed and in current use in several scientific digital libraries, for example CiteSeerx.
Keywords/Search Tags:Tables, Search, Automatic table extraction, Digital libraries, Information, Table data
Related items