Font Size: a A A

Research On Efficient Web Data Extraction Technology Based On Visual Information

Posted on:2020-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2428330575495057Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web data extraction technology plays an important role in network data mining,intelligence acquisition,business competition and big data analysis.With the popularity and rapid development of the Internet,the massive data information carried in the network has become a valuable resource.However,due to the difficulty in obtaining Web pages,the inconsistency of data formats,and the ubiquitous noise information,Web data is difficult to fully utilize.How to effectively extract the structured data contained in the web page has become a hot research direction.This paper analyzes the reasons for the difficulty of obtaining web pages and the visual information characteristics of web dynamic pages,and conducts in-depth research on DOM tree matching algorithm,web data extraction rule description language and web data automatic extraction technology.The main work contents are as follows:(1)This paper analyzes the characteristics of traditional DOM tree matching,and proposes a DOM tree matching algorithm based on XPath and LCS in combination with the automatic extraction process of web data.The algorithm not only reduces the time complexity of DOM tree matching to improve the efficiency of DOM tree matching,but also uses XPath for data extraction to improve the accuracy of data extraction.(2)Analyzed the dynamic web page technology and web page visual features commonly used in Web2.0,and proposed the WDERD language for extracting rule description,which solved the difficulty of obtaining web pages in web dynamic data extraction.The language describes the operation of web page operations,data markup,looping process and web page rendering,and describes the whole process of web data extraction in detail.According to the generation process of WDERD language extraction rules,this paper designs a custom plug-in extension through Chromium embedded framework to record user operation actions and operation page elements,and automatically generate WDERD language description of web data extraction rules,so that ordinary users can easily generate WDERD language rule descriptions through simple page interaction to complete the task of web data extracting.(3)This paper designs and implements a web data extraction system based on visual information.This system is divided into WDERD language parsing module,page acquisition module.DOM tree matching module,data recording module,data item extraction module and data storage module.Compared with the OXPath wrapper and the Octopus data collector,the results show that the web data extraction system based on visual information improves the efficiency of web data extraction and improves the efficiency of web data extraction.
Keywords/Search Tags:Web data extraction, Visual information, DOM tree, Extraction rule descriptive language, Ajax dynamic pages
PDF Full Text Request
Related items