Font Size: a A A

Webpage Data Automatic Extraction Technology

Posted on:2019-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z WangFull Text:PDF
GTID:2428330572998090Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Web technology,Web has become the main carrier for information and the main channels for people to access information.Therefore,large amounts of data are stored on the Internet in the form of webpage.However,because of the different HTML coding styles,we can not extract structured data directly from webpages,which has caused a huge waste of resources.With the development of big data,the importance of data is gradually reflected.In order to be able to access the huge data on the Internet,people have proposed a variety of web data extraction methods.According to the different goals,web data extraction can be divided into two types:(1)webpage text extraction,which mainly aim at extracting the text from the article type webpage.(2)webpage structured data extraction,which mainly aim at extracting the instance objects existing in the webpage.In this paper,the corresponding extraction methods are proposed respectively for these two different extraction targets.For webpage text extraction,Webpages contain not only the contents of the text,but also the noise information which is not related to the theme,such as navigation strips,advertisements,copyright notices and so on.These huge noise information poses a great challenge to the extraction of web pages.Therefore,a clustering webpage based text information extraction method is proposed in this paper.The method has two parts:first,clustering webpages based on the structural characteristics of web pages;second,location feature generation of text content blocks for similar web collections.This method can extract text content information from various types of web pages.For web structured data extraction,DOM tree path is used as extraction rule.However,extraction rules based on DOM path make it difficult to extract accurately when web structure changes slightly.Therefore,this paper presents a wrapper tree with semi automatic generation method based on the method mainly consists of three parts:the first generation,and abstract tree construction with tree;second,node localization and wrapper generation of merged trees;third,with tree reconstruction and the target web data extraction.This method makes it possible to extract structured data accurately in the case of slight changes in the structure of the web page.According to the two methods presented in this paper,the corresponding data extraction system is implemented and a large number of experiments are carried out.The experimental results show the feasibility and effectiveness of the proposed method.
Keywords/Search Tags:Web Data Extraction, Webpage Clustering, node density, Wrapper, Structured data extraction
PDF Full Text Request
Related items