Design And Implementation Of A Conventional Template About Page Extraction

Posted on:2016-10-15

Degree:Master

Type:Thesis

Country:China

Candidate:C R Luo

Full Text:PDF

GTID:2298330467493022

Subject:Computer Science and Technology

Abstract/Summary:

After decades of development, Internet has become the main source of informations and most of the information is displayed by HTML page. This is because the HTML is very easy to use and very flexible. But it also has disadvantages:HTML mix the data and the display rules together without any border. So itâ€™s very difficult to use these plenty mount of informations without the HTML pages. To take advantage of the informations in Internet, itâ€™s very necessary to extract and format the data on HTML pages.In this paper, a method to extraction formatted information is displayed. We devide the web pages into three classes:simple static pages, self-alike pages and dynamic pages.(1) The way the simple static page extracted is mainly based on XML document. The XML document tells the system where the nodes contain useful information, and how to extract it.(2) Self-alike pages mean the pages which contain some same part such as the list pages. The key pots we extract the self-alike pages are find these parts which is alike and extract all the information nodes.(3) The dynamic page means the page which information on it will be changed with the user reading. First we static the dynamic page and get useful information in the static page ways.After introducing the way how to extract these three kinds of pages, the implements of these approaches are displayed. Then we show the test results of the simple static page by extracting the news detail pages, and check out the efficient of the self-alike pages and dynamic pages by extracting the list pages. We also give the resource consumption which is taken by the extraction system.

Keywords/Search Tags:

HTML, Data Extraction, Ajax, Fromat, WebsiteInformation

Related items

1	Data Extraction And Integration In HTML Tables
2	Research On The HTML And PDF Informaiton Extraction Technology Based XML
3	Research And Application On The Technology Of Web Information Extraction Based On The HTML
4	ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data
5	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
6	Based On The Html Pages Of Web Information Extraction
7	The Technology Of Web Information Extraction Based On HTML Parser
8	Study Of Web Data Extraction Based On Webpage Structure
9	Context-based content extraction of HTML documents
10	The Research On Web Information Extraction Based On HMM