Font Size: a A A

Research On Web Data Extraction Technology Based On Template And Visual Features

Posted on:2019-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:K J WuFull Text:PDF
GTID:2428330545474860Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous growth of Web database,the access to Web resources dynamically presented in the form of HTML pages has gradually become the main means of information acquisition through query interface access.Effectively acquiring and integrating database resources distributed on the Web has important practical significance and broad application prospects.This paper starts with the acquisition and integration of Web database resources.Heterogeneous,dynamic,and multi-source characteristics of Web data lead to problems such as low extraction accuracy,low extraction efficiency,and inability to integrate multiple source data in existing Web data extraction methods.The visual based Web data extraction,Web data template construction and multi source Web data are proposed.The main works of the paper are as follows:(1)According to the visual characteristics of Web data records,the structural similarity of the data records in the query results page and the diversity of text organization are studied.For the existing Web data extraction methods,Web data records cannot be accurately extracted,the VDLE(Vision and DOM-tree based Web data Location and Extraction,VDLE)method is proposed.This method introduces visual block centroid offset to locate data area,use the spectral clustering algorithm to locate similar clusters of nodes within the data area,and combines the text organization diversity to locate the data records.Experimental results show that the VDLE extraction accuracy rate is 99%,which is 8.51% higher than D-EEM(the DOM-tree based entity extraction mechanism for Deep Web,D-EEM)and 4.32%higher than ViDE;the recall rate of VDLE results was 98.75%,which was 13.33%higher than the D-EEM recall rate and 8.17% higher than the VIDE recall rate.VDLE can accurately extract data records with the same format,but cannot extract data records with different structures,and cannot filter the noise information inside the data title attribute items.(2)On the basis of the proposed VDLE method,a nonlinear fitting Web data template construction(Nonlinear Fitting Web data Template Construction,NFTC)is proposed by analyzing the common features of the data record attribute items in the query result page.This method starts from the visual,structural,textual and semantic features of data records,introduces nonlinear data fitting ideas to construct Web data templates to solve the problem that VDLE cannot extract heterogeneous data records.At the same time,the text features of the data title are analyzed and the DOM subtreeis pruned to solve the failure of VDLE to filter the noise information inside the data title attribute items.The experimental results show that the accuracy of the results obtained by using the template constructed by NFTC is 100%,which is 5.32% higher than that of ViDE,9.51% higher than D-EEM,and 1% higher than VDLE.The recall rate of NFTC is 100%,which is 9.42% higher than ViDE,14.58% higher than D-EEM and 1.25% higher than VDLE.The average extraction time was 55.15 ms,which was 69% lower than ViDE,44% lower than D-EEM,and 56% lower than VDLE.(3)In order to extract the data from different Web databases,based on the study of existing lexical similarity measurement methods,the MSDF(Multilevel semantic measurement data fusion,MSDF)algorithm is proposed.Firstly,improve the semantic similarity measurement method based on HowNet according to the density,depth,and information quantity of the senmatic information;then,introduce the normalized Google distance to improve the search engine-based semantic relevance measurement method,and use the analytic hierarchy process to fuse the dictionary-based with the search engine's semantic similarity and correlation measurement results,the extracted Web data is mapped into a unified and structured form to solve the problem of multi-source Web data fusion.Experimental results show that the accuracy rate of MSDF is 98.5%,which is 82.3% higher than that of HowNet-based similarity measurement method,and 16.5% higher than that of search engine-based correlation measurement method;The fusion rate of MSDF is 97%,which is 76% higher than that based on HowNet's similarity measure,and 12.5%higher than that based on search engine's relevance measure.(4)Using the research results of this paper,we designed and developed the Web data extraction system to achieve accurate and efficient Web data extraction and integration.
Keywords/Search Tags:Web data extraction, Template, Semantics, Data fusi
PDF Full Text Request
Related items