Font Size: a A A

Research On Deep Web Information Extraction Based On Visual Block And Semantic DOM

Posted on:2017-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:L SunFull Text:PDF
GTID:2278330485466747Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The information which can only be got from the result pages called Deep Web.These result pages need users to submit the form query and return the results from the database behind.Currently,the research on Deep Web is a popular topic.But as the page structure becomes more complex,and the introduction of dynamic Web page technology,which makes the Deep Web pages becomes heterogeneity and semi-structured.So how to quickly and efficiently extract the data which users interested from these semi-structured results pages in order to provide a specific service becomes a difficulty.Currently,the main problems which needs to study includes:(1)How to quickly and efficiently identify the noise information,so we can wash the page as clean as possible before analysis the origin page;(2)How to quickly locate the master data area of the Web page according to the DOM tree structure and the visual information of the page;(3)How to automatically extract the page data and not affected by the influence of the page structure difference.According to the above problem,The traditional Web page analysis method based on DOM tree has been unable to meet the needs of users.Because the analysis method based on DOM tree is mainly depends on the structure characteristics of the DOM tree,and needs to parse all labels of the page,then transform the page into the DOM tree.This method ignore some effective visual features of the page,And once the page structure changes,this method needs to analyze the structure of the page again and then extract the data from the page.Currently, the Microsoft research Asia,proposes a new page data extraction method,called VIPS algorithm.VIPS algorithm is different from the traditional based on the DOM tree extraction method.This method from people’s visual point of view,divide the page into many effective visual blocks firstly,and then restructuring a visual tree base on the semantics.VIPS algorithm broke the previous traditional extraction method like based on the DOM tree method,this method set up a bridge between the DOM tree structure and the semantic of the page.This article analysis the characteristics of Deep Web results page,and combine with the characteristics of human visual, then put forward a method called Extraction Data Information based on the Basic visual block.This method is on the basis of VIPS algorithm.The method first analyses the tag of the page,before the parser parsed the Web document into a syntax tree, remove some information which has nothing to do with the theme such as navigation, advertising, etc. Then division the DOM tree into semantic block using VIPS algorithm.After division the tree,find the standard block according to the block’s position firstly,then put the standard block as center block to reverse traversal and sequential traversal the DOM tree use the linear feature vector criterion to find all similar visual blocks.These blocks are the information blocks which we want to extraction.According to the experimental results,this method based on the Basic Visual Block is feasibility and has some improvement in the aspect of extracting data accuracy compared with the traditional method.
Keywords/Search Tags:Data extraction, DOM tree, VIPS algorithm, Visual characteristics, Basic Visual Block
PDF Full Text Request
Related items