Font Size: a A A

Study On Key Techniques And System For Accurate Web Information Extraction

Posted on:2018-03-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:S S ShiFull Text:PDF
GTID:1318330512499396Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet technology,the World Wide Web becomes the major platform for information publication and application deployment for enterprises and organizations around the world.The emergence of a large number of websites and web applications leads to the burst of web data.Much valuable information is embedded in the massive web data.To obtain,analyze and utilize the valuable information,it usually needs to obtain accurate and useful structured data from the web,and then performs deep analysis on the structured data.However,due to the widespread distribution and autonamy of the web system,the heterogeneity and unstructured feature of web data,and the inconsistency between the web data presentation structure and the target data representation structure,it becomes a hard technical problem to effectively obtain accurate and useful structured data from the web.Web information extraction is just a research field emerged for dealing with the problem.Web information extraction researches on how to extract data of user's interest and transform them into the target-structure data.An integrated web information extraction process can be divided into three stages:web page navigation,data extraction and data integration.However,most existing research works mainly focus on web page data extraction,while neglecting web page navigation and web page data integration.This leads to the lack of ability and process for integrated web information extraction.Meanwhile,most existing works overemphasize fully automated analysis and extraction in theory.There are mainly two kinds of corresponding methods:automatic web page data extraction methods;methods for open data extraction from heterogeneous Web pages.The former methods do not consider user's requirement,and may extract many redundant data of no user's interest.This makes analytical applications need to perform secondary treatments such as transformation,cleansing,filtering and so on.The latter methods do not use any extraction rule template for particular web pages,and try to extract data of user's interest from heterogeneous web pages describing the same entity.This makes the latter methods have low data extraction accuracy in general.To overcome the shortages above of the existing works,this thesis tries to synthesize automated methods and the practical application requirements of accurate web information extraction.Facing the integrated web information extraction process,this thesis studies basic models,language and key technical methods for accurate web information extraction,and gives the design and implementation of the corresponding prototype system.In detail,the major research work and contributions are as follows:(1)Research on the basic models for the three-stage accurate web information extractionFirstly,study and propose the three-stage accurate web information extraction model.Then,study and propose the web page navigation model,the data extraction model,and the data integration model for the three stages respectively.The web page navigation model builds the interaction and navigation action model,the web page navigation path model,and the web page interlinkage relationship model to describe user interaction actions,web page navigation process,and web page interlinkage relationship respectively.The web page data extraction model builds the basic model of web page data extraction,the web page data record model,and the extraction rule model of data records and data fields to describe the web page data extraction process,the structure form of web page data records,and the extraction rule framework of data records and data fields respectively.The web page data integration model describes the process to transform the source web page data into the target-structure data.(2)Research on the rule system and the language for the three-stage accurate web information extractionBased on the basic models for the three-stage accurate web information extraction,study and design a rule system and a language for the three-stage accurate web information extraction.Corresponding to the three stages of the accurate web information extraction process,the rule system and the language involve three parts:web page navigation rule language,data extraction rule language,and data integration rule language.Compared with the existing web information extraction languages,the major advantages of this language include:1)web page navigation rule language can define web page navigation rules for various complex web page navigation process;2)data extraction rule language can define extraction rules for various complicated-structure data records;3)data integration rule language can define data integration rules conveniently and flexibly.(3)Research on automatic web page data extractionThe existing automatic web page data extraction approaches mainly apply to extract simple-structure data records(continuous,fixed-length and linear data records),and are difficult to effectively extract complicated-structure data records(non-continuous,variable-length or nested data records).Aiming at this shortage,study and propose two automatic web page data extraction methods:automatic web page data extraction based on cohesion and DAG(Directed Acyclic Graph),and automatic web page data extraction based on deterministic finite automaton(DFA).The former applies to extract continuous,fixed-length(variable-length)and linear data records,while the latter can extract various simple-structure or complicated-structure data records.(4)Research on accurate web information extraction rule generationTo help the users to efficienly generate robust accurate web information extraction rules,study and propose an accurate web information extraction rule generation method based on user interaction,automatic web page structure analysis and supervised rule learning.For web page navigation rule generation,the rules will be generated by automatically recording user interaction and navigation actions.For web page data extraction rule generation,for web pages containing regular data records,the automatic web page data extraction methods introduced above will be used to analyze web page structure,and then the rules will be generated automatically based on supervised rule learning;for web pages containing irregular data records,the rules will be generated based on user interaction and supervised rule learning.For web page data integration rule generation,the rules will be generated by simple script programming.(5)The design and the implementation of a prototype accurate web information extraction systemTo evaluate the proposed models,the rule language and the key technical methods,the thesis designs and implements a prototype accurate web information extraction system.The experimental results demonstrate that the models and the key technical methods proposed for accurate web information extraction are effective,and work with better accuracy and more powerful ability compared with the existing technical methods.
Keywords/Search Tags:accurate web information extraction, navigation, data integration, data record, data field
PDF Full Text Request
Related items