Font Size: a A A

Web Page Segmentation Based On Semi-supervised Structured Learning

Posted on:2018-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:H Y FengFull Text:PDF
GTID:2348330512497197Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Web page segmentation aims to break a page into visually and semantically coherent blocks emulating human visual perception.Most exist methods utilize heuristic rules or machine learning models to recognize blocks.However,the heuristics is myopic and lack a full stuctural analysis of whole page.Supervised learning requires a large and representative labeled training data for a good generalization,but labeled web pages are fairly expensive to obtain.To counter these shortcomings,a new web page segmentation method is brought up based on semi-supervised structured learning in this paper.With the segmentation graph structure extracted from web page,we formulate the segmentation as a label assignment task of each boundary to judge that whether current block should be segmented by it or not.Also we abstract the computation of highest scoring label assignment to a 0-1 integer linear programming problem,and utilize an extended co-training structured support vector machine to learn the joint feature weights.The work of this paper is as follows:1.Review and analyze the existing methods.To overcome their problems,we extract the web segmentation graph structure to reflect the candidate segmentation boundaries and the dependency relation between parents and adjacency.Thus,the segmentation of web page is transformed into a structured label assignment task on the graph.2.Local features and context features are extracted to build the joint feature representation of segmentation graph.The inference of best label assignment is abstracted to a 0-1 linear programming problem,which is solved by optimizing its linear programming relaxation.Using ensemble method to extend co-training framework,Co-Structured SVM model aims to learn the feature weights.3.The experimental results demonstrate that Co-Structured SVM which combines the advantage of structured learning with semi-supervised learning,has better performance than other methods.It utilizes unlabeled data and provides a good structural analysis of web page.
Keywords/Search Tags:Web Page Segmentation, Semi-Supervised Learning, Structured Learning, Co-Structured SVM
PDF Full Text Request
Related items