Font Size: a A A

Design And Implementation Of A Conventional Template About Page Extraction

Posted on:2016-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:C R LuoFull Text:PDF
GTID:2298330467493022Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
After decades of development, Internet has become the main source of informations and most of the information is displayed by HTML page. This is because the HTML is very easy to use and very flexible. But it also has disadvantages:HTML mix the data and the display rules together without any border. So it’s very difficult to use these plenty mount of informations without the HTML pages. To take advantage of the informations in Internet, it’s very necessary to extract and format the data on HTML pages.In this paper, a method to extraction formatted information is displayed. We devide the web pages into three classes:simple static pages, self-alike pages and dynamic pages.(1) The way the simple static page extracted is mainly based on XML document. The XML document tells the system where the nodes contain useful information, and how to extract it.(2) Self-alike pages mean the pages which contain some same part such as the list pages. The key pots we extract the self-alike pages are find these parts which is alike and extract all the information nodes.(3) The dynamic page means the page which information on it will be changed with the user reading. First we static the dynamic page and get useful information in the static page ways.After introducing the way how to extract these three kinds of pages, the implements of these approaches are displayed. Then we show the test results of the simple static page by extracting the news detail pages, and check out the efficient of the self-alike pages and dynamic pages by extracting the list pages. We also give the resource consumption which is taken by the extraction system.
Keywords/Search Tags:HTML, Data Extraction, Ajax, Fromat, WebsiteInformation
PDF Full Text Request
Related items