Research And Application In Automatic Data Extraction From WEB Pages

Posted on:2016-12-07

Degree:Master

Type:Thesis

Country:China

Candidate:R Li

Full Text:PDF

GTID:2308330473954485

Subject:Computer software and theory

Abstract/Summary:

With the rapid development of Internet, there is a vast amount of data which could be utilized in the network. And the data object which records from the web database and displayed in the user response pages with some predefined templates, is a kind of very important Web data type. Such records which show the product or service information, constituting the main content of the page, are rich in a lot of valuable information. So the research on how to extract the data from the web pages that contant this kind of data records, is of great realistic significance and practical value.In view of this kind of multiple records data-intensive web pages, this paper proposes a main data area recognition method which is based on visual information. This method can effectively identify the main data area which contains several data records, and get the corresponding labels subtree. Firstly, this method will build an expanded tags tree of the page based on visual position information, and clean up the pageâ€™s irrelevant tag nodes. Then it will identify the main data area based on the pageâ€™s visual features and get the corresponding tags tree. According to the content to be extracted in pages, algorithm will remove useless nodes and noise blocks, shrink label treeâ€™s size, to reduce the amount of calculation for further extraction process and improve the efficiency of extraction.In addition, this paper also designs and implements an automatic Web data extraction system based on tags tree. The system can extract data automaticly from the semi-structured data records which contained in data-intensive web pages, and output these data in a structured form. The core extracting process of the system contains threes modules: tree matching calculation, data recording identification and data item extraction. Based on the visual information extended tag tree, which generated by the main data area recognition algorithm, it will use the matching calculation of the tag tree, and followed by data region determining, data record identification and data item extraction alignment, through the progressive step by step process to reduce the size of the target area and extract the data.The result of the extracting tests shows that the system can extract automaticly and effectively from the multiple records data-intensive pages, and it will extract these data records as a structured data form. It is able to adapt to a wide range of actual demand, and has the application value for a further promoting.

Keywords/Search Tags:

Web data extraction, automatic extraction, tag tree, visual information

Related items

1	Research On Efficient Web Data Extraction Technology Based On Visual Information
2	Research On Web Data Extraction Technology
3	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
4	Research On Deep Web Information Extraction Based On Visual Block And Semantic DOM
5	Study On Automatic Extraction Of Web Data Based On DOM
6	Research On Automatic Web Information Extraction Technique
7	Automatic Ranking List Extraction From Web Pages Based On Visual And Sematic Information
8	Research Of Web Information Extraction Technology Based On Tree Structure
9	Research And Application Of Automatic Data Extraction From Template-generated Web Pages
10	Research On Automatic And Efficient Technologies For Web Information Extraction