Font Size: a A A

The Study Of Rule Induction For Automatic WEB Data Extraction

Posted on:2016-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y ShenFull Text:PDF
GTID:2348330461960091Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the Internet technology,more and more enterprises and organizations publish their information by using websites.This makes the information on the Web grows explosively.Consequently,more and more studies and applications hope to obtain useful information from the Web so as to perform deep analysis and offer deep information service.Therefore,Web information extraction studies how to extract structured data of interest to users or applications from unstructured Web pages.In the past two decades,the studies of Web information extraction have made great progress;however,the existing Web information extraction systems still have the following main shortages:(1)the semi-automatic systems are not satisfactory in automation degree;(2)the automatic systems are not high enough in precision and recall;(3)the automatic systems lack automatic annotation;(4)these systems are not fine enough in extraction granularity;(5)they do not perform parallel Web information extraction,and thus cannot perform large-scale Web information extraction efficiently.Aiming at the shortages of existing research works,this dissertation studies automatic Web information extraction rule generation technique.The main research works of this dissertation are as follows:(1)Some basic Web information extraction models are proposed as follows:the process model of Web information extraction,the Web data extraction model,the Web data record model and the data item model.Based on these models,this thesis designs a Web information extraction rule language that has strong description ability.(2)A multi-feature-based automatic Web page analysis technique is proposed in the dissertation.Aiming at the problem that existing automatic Web information extraction systems are not enough in precision,recall and granularity,this dissertation combines multiple features including DOM tree structure features,vision features and semantic features to automatically recognize Web data records and data items,and align the data items among similar data records.This dissertation also studies how to perform automatic data item annotation based on aligned data items.(3)An extraction rule generation technique is proposed based on automatic Web page analysis.This dissertation studies how to generate extraction rules based on the results of automatic example Web page analysis.This extraction rule generation technique involves the automatic generation of the extraction rules of data region,data record,and data item.(4)Based on the above studies,a prototype Web information extraction system is designed and implemented.In addition,to satisfy the needs of large-scale Web information extraction,this dissertation also proposes a Hadoop-based parallel approach for large-scale Web information extraction.The experiments for evaluating the above techniques are executed in the dissertation.Experimental results show that the multi-featured-based automatic Web page analysis technique achieves high precision and recall,the automatically-generated extraction rules also achieve high precision and recall,and the parallel approach for large-scale Web information extraction can obtain linear acceleration.
Keywords/Search Tags:Web information extraction, extraction rule, rule induction, multi-features, parallel extraction
PDF Full Text Request
Related items