Font Size: a A A

Research On Page Segmentation Based On CEF

Posted on:2016-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:B Y ZhuFull Text:PDF
GTID:2308330473958505Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present, Research on Deep Web data automatic extraction has been a lot of achievements. Compared with other methods, the method based on vision information of webpage obtains better effect, which getting rid of the dependence of DOM tree, using only vision information of webpage, improving the extraction accuracy. There have been many ways to realize the division of the web now. In these methods, the VIPS algorithm (Vision based Page Segmentation) achieves the best effect. Comparing to other existing methods, the VIPS algorithm is independent to webpage structure and works well even when the HTML structure changed. Due to the advantages of the VIPS algorithm, paper users the VIPS algorithm to realize the segmentation of webpages.In order to improve the efficiency, this thesis studies to realize VIPS algorithm in CEF (Chromium Embedded Framework) and realizes the method which can obtain vision information of webpage node in CEF.Our work includes the following aspects:(1) Visual Block Extraction. First of all, obtain vision information of the node in webpage by using javascript. Then use the vision information to judge whether the DOM node can be segmented with heuristic rules. If not, put it into a set as a visual block.(2) Separator Detection. In the first step, we treat the whole page as a separator bar and calculate position and size of the separator bar according to the visual block obtained. And then split, remove, update the separator according to the relation of the vision block with each separator. In the end, set weights for separators.(3) Content Structure Construction. The process starts from the separators with the lowest weight and the blocks beside these separators are merged to form new blocks. Then select the separators which weight is smaller, merge the blocks beside these separators, and so on. This merging process iterates till separators with maximum weights are met. Last, The content structure is constructed and visual block tree corresponding to the page is obtained.Experiments show that the approach we proposed can divide web page into visual blocks effectively.
Keywords/Search Tags:Deep Web, data extraction, vision information, VIPS, CEF
PDF Full Text Request
Related items