Research On Deep Web Information Extraction Based On Visual Block And Semantic DOM

Posted on:2017-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:L Sun

Full Text:PDF

GTID:2278330485466747

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The information which can only be got from the result pages called Deep Web.These result pages need users to submit the form query and return the results from the database behind.Currently,the research on Deep Web is a popular topic.But as the page structure becomes more complex,and the introduction of dynamic Web page technology,which makes the Deep Web pages becomes heterogeneity and semi-structured.So how to quickly and efficiently extract the data which users interested from these semi-structured results pages in order to provide a specific service becomes a difficulty.Currently,the main problems which needs to study includes:(1)How to quickly and efficiently identify the noise information,so we can wash the page as clean as possible before analysis the origin page;(2)How to quickly locate the master data area of the Web page according to the DOM tree structure and the visual information of the page;(3)How to automatically extract the page data and not affected by the influence of the page structure difference.According to the above problem,The traditional Web page analysis method based on DOM tree has been unable to meet the needs of users.Because the analysis method based on DOM tree is mainly depends on the structure characteristics of the DOM tree,and needs to parse all labels of the page,then transform the page into the DOM tree.This method ignore some effective visual features of the page,And once the page structure changes,this method needs to analyze the structure of the page again and then extract the data from the page.Currently, the Microsoft research Asia,proposes a new page data extraction method,called VIPS algorithm.VIPS algorithm is different from the traditional based on the DOM tree extraction method.This method from people’s visual point of view,divide the page into many effective visual blocks firstly,and then restructuring a visual tree base on the semantics.VIPS algorithm broke the previous traditional extraction method like based on the DOM tree method,this method set up a bridge between the DOM tree structure and the semantic of the page.This article analysis the characteristics of Deep Web results page,and combine with the characteristics of human visual, then put forward a method called Extraction Data Information based on the Basic visual block.This method is on the basis of VIPS algorithm.The method first analyses the tag of the page,before the parser parsed the Web document into a syntax tree, remove some information which has nothing to do with the theme such as navigation, advertising, etc. Then division the DOM tree into semantic block using VIPS algorithm.After division the tree,find the standard block according to the block’s position firstly,then put the standard block as center block to reverse traversal and sequential traversal the DOM tree use the linear feature vector criterion to find all similar visual blocks.These blocks are the information blocks which we want to extraction.According to the experimental results,this method based on the Basic Visual Block is feasibility and has some improvement in the aspect of extracting data accuracy compared with the traditional method.

Keywords/Search Tags:

Data extraction, DOM tree, VIPS algorithm, Visual characteristics, Basic Visual Block

PDF Full Text Request

Related items

1	Research Of Text Extraction Algorithm Based On Visual Semantic Block
2	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
3	Research On Technique Of Self-adaptive Web Data Extraction
4	Algorithms Based On Visual Similarity Of The Research In Information Extraction And Implementation
5	Web Page Metadata Extraction Method Based On Visual Block Recognition
6	Research On Efficient Web Data Extraction Technology Based On Visual Information
7	WEB Page Theme Block Identification According To Combination Features
8	Image Visual Security Assessment Based On HVS Characteristics
9	Reserch And Implementation Of Webpage Cleaning Algorithm Based On Visual Information
10	A Study On Web Information Extraction Algorithm And Agricultural Application