The Formant Criteria Of Degree Paper At Ustl

Posted on:2016-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Zhang

Full Text:PDF

GTID:2308330470479826

Subject:Software engineering

Abstract/Summary:

Web page content extraction technique aims to extract the text that has whole structure and related subject in the web page. The traditional methods have a part of extracting the body of the page by wrappers, this method of extracting text pages is based on DOM tree structure, that find your page in a web page HTML source code of the body area. Also it is according to the links page density to extract text. But these kind of extraction methods are relatively too unitary. These extract methods are only adapted to the tationary features of the corresponding web page. If the page does not comply with the relevant provisions of its extraction method,it will lead to lower extraction precision and may even lead extraction failure. Therefore, this paper presents an advanced algorithm which can set policies related to extract the body of the web page. Firstly, according to the structure of web page, this algorithm divides it into several text blocks, and then these text blocks will be filtered by a variety of extract tactics, and in this algorithm, its corresponding extraction strategies have been optimized, so that it can adapt to the web pages which have a various of features and structures.This algorithm has drawn high accuracy, robustness, adaptability features.In addition to the actual content, web pages are typically consist of a lot of interference content, such as navigational elements, templates, and advertisements.This boilerplate text is typically not related to the main content of the page, and is possible to deteriorate extraction accuracy, and therefore it needs to be detected properly. In this paper, we analyze a small set of shallow text features to classify the individual text blocks in a web page. Moreover, during the analysis of this algorithm,we derive a simple and plausible stochastic model to describe the creating process of a certain text block, while the the process of creating a text block, the noise content will be removes accurately, which thereby can improve extract performance and accuracy.Finally, we extend the principled approach by straight-forward heuristics,achieving a remarkable accuracy.

Keywords/Search Tags:

Text extraction, Web pages, navigation, retrieval performance

Related items

1	The Research On Text Extraction From Web Pages
2	Research On Elimination Of Similar Web Pages Based On Text Structure And Extraction Of Long Sentences
3	Research And Implementation On Chinese Web Pages-Oriented Information Extraction Technologies
4	Detection Of Near-replicas Of Web Pages Based On Text Structure
5	The Research And Realization Of Web-Personalized Navigation Pages Based On Types Of Users
6	A collaborative filtering approach to predict web pages of interest from navigation patterns of past users within an academic website
7	The Design And Implementation Of Vertical Search Engine Based On Duplicated Web Pages Elimination
8	Multi-document Retrieval System Design And Development
9	The Design And Implementation Of Information Retrieval And Retrieval Analysis Subsystem Of Scientific Research Literature
10	Research And System Implementation On Trademark Retrieval