Research On Page Segmentation Based On CEF

Posted on:2016-11-01

Degree:Master

Type:Thesis

Country:China

Candidate:B Y Zhu

Full Text:PDF

GTID:2308330473958505

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

At present, Research on Deep Web data automatic extraction has been a lot of achievements. Compared with other methods, the method based on vision information of webpage obtains better effect, which getting rid of the dependence of DOM tree, using only vision information of webpage, improving the extraction accuracy. There have been many ways to realize the division of the web now. In these methods, the VIPS algorithm (Vision based Page Segmentation) achieves the best effect. Comparing to other existing methods, the VIPS algorithm is independent to webpage structure and works well even when the HTML structure changed. Due to the advantages of the VIPS algorithm, paper users the VIPS algorithm to realize the segmentation of webpages.In order to improve the efficiency, this thesis studies to realize VIPS algorithm in CEF (Chromium Embedded Framework) and realizes the method which can obtain vision information of webpage node in CEF.Our work includes the following aspects:(1) Visual Block Extraction. First of all, obtain vision information of the node in webpage by using javascript. Then use the vision information to judge whether the DOM node can be segmented with heuristic rules. If not, put it into a set as a visual block.(2) Separator Detection. In the first step, we treat the whole page as a separator bar and calculate position and size of the separator bar according to the visual block obtained. And then split, remove, update the separator according to the relation of the vision block with each separator. In the end, set weights for separators.(3) Content Structure Construction. The process starts from the separators with the lowest weight and the blocks beside these separators are merged to form new blocks. Then select the separators which weight is smaller, merge the blocks beside these separators, and so on. This merging process iterates till separators with maximum weights are met. Last, The content structure is constructed and visual block tree corresponding to the page is obtained.Experiments show that the approach we proposed can divide web page into visual blocks effectively.

Keywords/Search Tags:

Deep Web, data extraction, vision information, VIPS, CEF

PDF Full Text Request

Related items

1	Research On Deep Web Information Extraction Based On Visual Block And Semantic DOM
2	Deep Web Mining Combining Vision And DOM Information
3	Research On Technique Of Self-adaptive Web Data Extraction
4	Web Topic Information Extraction System Design And Implementation
5	Research On Vision-based Web Page Information Extraction Technology
6	Research On Web Information Extraction Technology In Vertical Search Engine
7	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
8	Research On Image Key Information Extraction Algorithm Based On Computer Vision
9	Research On Web Information Extraction Technology Based On Deep Web
10	Automatic Data Extraction Of The Query Answer From The Deep Web