Font Size: a A A

The Formant Criteria Of Degree Paper At Ustl

Posted on:2016-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y ZhangFull Text:PDF
GTID:2308330470479826Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web page content extraction technique aims to extract the text that has whole structure and related subject in the web page. The traditional methods have a part of extracting the body of the page by wrappers, this method of extracting text pages is based on DOM tree structure, that find your page in a web page HTML source code of the body area. Also it is according to the links page density to extract text. But these kind of extraction methods are relatively too unitary. These extract methods are only adapted to the tationary features of the corresponding web page. If the page does not comply with the relevant provisions of its extraction method,it will lead to lower extraction precision and may even lead extraction failure. Therefore, this paper presents an advanced algorithm which can set policies related to extract the body of the web page. Firstly, according to the structure of web page, this algorithm divides it into several text blocks, and then these text blocks will be filtered by a variety of extract tactics, and in this algorithm, its corresponding extraction strategies have been optimized, so that it can adapt to the web pages which have a various of features and structures.This algorithm has drawn high accuracy, robustness, adaptability features.In addition to the actual content, web pages are typically consist of a lot of interference content, such as navigational elements, templates, and advertisements.This boilerplate text is typically not related to the main content of the page, and is possible to deteriorate extraction accuracy, and therefore it needs to be detected properly. In this paper, we analyze a small set of shallow text features to classify the individual text blocks in a web page. Moreover, during the analysis of this algorithm,we derive a simple and plausible stochastic model to describe the creating process of a certain text block, while the the process of creating a text block, the noise content will be removes accurately, which thereby can improve extract performance and accuracy.Finally, we extend the principled approach by straight-forward heuristics,achieving a remarkable accuracy.
Keywords/Search Tags:Text extraction, Web pages, navigation, retrieval performance
PDF Full Text Request
Related items