Font Size: a A A

Research And Application In Automatic Data Extraction From WEB Pages

Posted on:2016-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:R LiFull Text:PDF
GTID:2308330473954485Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, there is a vast amount of data which could be utilized in the network. And the data object which records from the web database and displayed in the user response pages with some predefined templates, is a kind of very important Web data type. Such records which show the product or service information, constituting the main content of the page, are rich in a lot of valuable information. So the research on how to extract the data from the web pages that contant this kind of data records, is of great realistic significance and practical value.In view of this kind of multiple records data-intensive web pages, this paper proposes a main data area recognition method which is based on visual information. This method can effectively identify the main data area which contains several data records, and get the corresponding labels subtree. Firstly, this method will build an expanded tags tree of the page based on visual position information, and clean up the page’s irrelevant tag nodes. Then it will identify the main data area based on the page’s visual features and get the corresponding tags tree. According to the content to be extracted in pages, algorithm will remove useless nodes and noise blocks, shrink label tree’s size, to reduce the amount of calculation for further extraction process and improve the efficiency of extraction.In addition, this paper also designs and implements an automatic Web data extraction system based on tags tree. The system can extract data automaticly from the semi-structured data records which contained in data-intensive web pages, and output these data in a structured form. The core extracting process of the system contains threes modules: tree matching calculation, data recording identification and data item extraction. Based on the visual information extended tag tree, which generated by the main data area recognition algorithm, it will use the matching calculation of the tag tree, and followed by data region determining, data record identification and data item extraction alignment, through the progressive step by step process to reduce the size of the target area and extract the data.The result of the extracting tests shows that the system can extract automaticly and effectively from the multiple records data-intensive pages, and it will extract these data records as a structured data form. It is able to adapt to a wide range of actual demand, and has the application value for a further promoting.
Keywords/Search Tags:Web data extraction, automatic extraction, tag tree, visual information
PDF Full Text Request
Related items