Font Size: a A A

Research Of Web Page Purifying Method Based On Document Object Model

Posted on:2010-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:C XuFull Text:PDF
GTID:2178360278961167Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
A commercial web page typically contains many information blocks. Apart from themain content blocks, it usually has such blocks as navigation panels, copyright and privacynotices, and advertisements for business purposes and for easy user access. We call theseblocks that are not the main content blocks of the page the noisy blocks. We show that theinformation contained in these noisy blocks can seriously harm Web data mining. Detectingand eliminating these noises is thus of great importance.This thesis proposes a new page segmentation model called DSS_DOM based on thefollowing observation: many popular commercial web pages are designed with the help of
tags and style sheets. Web designer likes to put the same semantic contents into a
block and control the layout of the
block by the style sheets. The technique iscalled"DIV plus CSS". Based on this observation, a web page is first partitioned into severalblocks using DSS_DOM. Secondly importance values are assigned to all the blocks using anevaluation algorithm. The algorithm involves the information of style sheets and the structureof DSS_DOM. The contents in low-importance-value-blocks are not-related-contents.DSS_DOM identifies the basic data unit by the structural features and semantic featuresand determines the logical structure of web pages. The algorithm based on DSS_DOMestimates importance of DIV blocks and identifies the not-related- blocks.The proposed technique is evaluated with two data mining tasks, Web search engine andWeb page classification. Experimental results show that our noise elimination technique isable to improve the mining results significantly.
Keywords/Search Tags:Web Page Purifying, DOM, Web Page Segmentation, Web Page Noises, WebPage Classification
PDF Full Text Request
Related items