Research On Efficient Web Data Extraction Technology Based On Visual Information

Posted on:2020-12-01

Degree:Master

Type:Thesis

Country:China

Candidate:P Wang

Full Text:PDF

GTID:2428330575495057

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Web data extraction technology plays an important role in network data mining,intelligence acquisition,business competition and big data analysis.With the popularity and rapid development of the Internet,the massive data information carried in the network has become a valuable resource.However,due to the difficulty in obtaining Web pages,the inconsistency of data formats,and the ubiquitous noise information,Web data is difficult to fully utilize.How to effectively extract the structured data contained in the web page has become a hot research direction.This paper analyzes the reasons for the difficulty of obtaining web pages and the visual information characteristics of web dynamic pages,and conducts in-depth research on DOM tree matching algorithm,web data extraction rule description language and web data automatic extraction technology.The main work contents are as follows:(1)This paper analyzes the characteristics of traditional DOM tree matching,and proposes a DOM tree matching algorithm based on XPath and LCS in combination with the automatic extraction process of web data.The algorithm not only reduces the time complexity of DOM tree matching to improve the efficiency of DOM tree matching,but also uses XPath for data extraction to improve the accuracy of data extraction.(2)Analyzed the dynamic web page technology and web page visual features commonly used in Web2.0,and proposed the WDERD language for extracting rule description,which solved the difficulty of obtaining web pages in web dynamic data extraction.The language describes the operation of web page operations,data markup,looping process and web page rendering,and describes the whole process of web data extraction in detail.According to the generation process of WDERD language extraction rules,this paper designs a custom plug-in extension through Chromium embedded framework to record user operation actions and operation page elements,and automatically generate WDERD language description of web data extraction rules,so that ordinary users can easily generate WDERD language rule descriptions through simple page interaction to complete the task of web data extracting.(3)This paper designs and implements a web data extraction system based on visual information.This system is divided into WDERD language parsing module,page acquisition module.DOM tree matching module,data recording module,data item extraction module and data storage module.Compared with the OXPath wrapper and the Octopus data collector,the results show that the web data extraction system based on visual information improves the efficiency of web data extraction and improves the efficiency of web data extraction.

Keywords/Search Tags:

Web data extraction, Visual information, DOM tree, Extraction rule descriptive language, Ajax dynamic pages

PDF Full Text Request

Related items

1	Research On Web Information Extraction Technology Based On
2	Research And Application In Automatic Data Extraction From WEB Pages
3	Research Of Web Information Extraction Technology Based On Tree Structure
4	The Research Of Semi-structured Web Pages Information Extraction
5	Design And Implementation Of Web Information Extraction Rules
6	Ontology-Based Structured Information Extraction From Web Pages
7	Automatic Ranking List Extraction From Web Pages Based On Visual And Sematic Information
8	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
9	Research Of Web Information Extraction Method Based On Multi-feature Mining
10	Research Of Automatic Metadata Extraction From Template Web Pages