Font Size: a A A

Context-based content extraction of HTML documents

Posted on:2007-11-03Degree:Ph.DType:Dissertation
University:Columbia UniversityCandidate:Gupta, SuhitFull Text:PDF
GTID:1458390005486094Subject:Computer Science
Abstract/Summary:
Web pages often contain "clutter" (defined by us as unnecessary images, navigational menus and extraneous links) around the body of an article that may distract a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including speech rendering for the visually disabled, cell phone and PDA browsing, and text summarization. Most existing approaches to making content more directly accessible involve changing font size or removing HTML and data components such as images, which may take away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction" and "Clutter Removal".; We introduce Crunch, a framework that employs an easily extensible set of techniques, for enabling and integrating heuristics concerned with "content extraction" from HTML web pages. Crunch is implemented as a transparent web proxy and is practically usable by end-users. We use DOM tree based content extraction rather than directly processing HTML as flat files. Crunch is a versatile solution, allowing programmers and administrators to add heuristics to the framework. These heuristics act as filters that can be parameterized and toggled to perform the content extraction. Crunch reduces human involvement in the application of thresholds for the heuristics by automatically detecting and utilizing the content genre of a given website. Genre detection is accomplished via the use of frequency distributions of words associated with the website and associated search engine snippets. These distributions are used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings.; We have measured the usability and performance of the content extraction proxy in terms of the quality of the output generated by the heuristics that act as filters after the proxy has inferred the context of a webpage. Ultimately, we show that rather than going with current approaches that are pre-packaged "one size fits all" and programmer controlled, going with a more flexible approach will produce a more content-full result.
Keywords/Search Tags:Content, HTML
Related items