Context-based content extraction of HTML documents

Posted on:2007-11-03

Degree:Ph.D

Type:Dissertation

University:Columbia University

Candidate:Gupta, Suhit

Full Text:PDF

GTID:1458390005486094

Subject:Computer Science

Abstract/Summary:

Web pages often contain "clutter" (defined by us as unnecessary images, navigational menus and extraneous links) around the body of an article that may distract a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including speech rendering for the visually disabled, cell phone and PDA browsing, and text summarization. Most existing approaches to making content more directly accessible involve changing font size or removing HTML and data components such as images, which may take away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction" and "Clutter Removal".; We introduce Crunch, a framework that employs an easily extensible set of techniques, for enabling and integrating heuristics concerned with "content extraction" from HTML web pages. Crunch is implemented as a transparent web proxy and is practically usable by end-users. We use DOM tree based content extraction rather than directly processing HTML as flat files. Crunch is a versatile solution, allowing programmers and administrators to add heuristics to the framework. These heuristics act as filters that can be parameterized and toggled to perform the content extraction. Crunch reduces human involvement in the application of thresholds for the heuristics by automatically detecting and utilizing the content genre of a given website. Genre detection is accomplished via the use of frequency distributions of words associated with the website and associated search engine snippets. These distributions are used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings.; We have measured the usability and performance of the content extraction proxy in terms of the quality of the output generated by the heuristics that act as filters after the proxy has inferred the context of a webpage. Ultimately, we show that rather than going with current approaches that are pre-packaged "one size fits all" and programmer controlled, going with a more flexible approach will produce a more content-full result.

Keywords/Search Tags:

Content, HTML

Related items

1	Research Of Key Techniques Of Large-scale Web Text Fast Categorization
2	Research On Content Extraction In HTML Web Pages Based Multi-Features
3	The Development Of Finical Content Management And Publication System
4	Content based image retrieval using evidence combination
5	Friend Lens: Novel Web content sharing through strategic manipulation of cached HTML
6	Research Web Content Mining Based On XML
7	Research And Implementation Of Network Application Content Audit Key Techniques Concerning HTTP
8	Product Demonstration Design And Development Demostration Based On HTML 5 Technology
9	Research Of Conversion From HTML Web Based On Contect Personalization
10	The Research And Implementation Of HTML Pages Cleanup Based On Web