Font Size: a A A

Design And Implementation Of DOM-based Noise Reduction System

Posted on:2011-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:L M LuoFull Text:PDF
GTID:2248330371963666Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet techniques, the information on the Internet has increased exponentially. As well as Internet offers us abundant information sources, it also brings us a challenge to get information quickly which we need. In order to use information sources efficiently, we need to preprocess these sources. The key issue of Webpage preprocessed is wiping off the noise data in WebPages, such as advertisement, navigation bars, copyright etc., so as to get the main information,that is noise reduction.Firstly, this paper introduced the methods and technology of Web page purification. The noise reduction method includes web-based approach, template-based approach and the method based on visual information etc. There are three areas in web purification technology: information extraction, page segmentation and web page adaptation.Secondly, this paper introduced the definition and structure of XHTML and DOM. Based on these, this paper proposes a web layout-based DOM model (WLB_DOM), which contains preprocess, label filter and model structure.Then, this paper proposed a WLB_DOM-based noise reduction algorithm. This is a combining algorithm based on web structure and visual information. The algorithm idea, structure and process is introduced in this paper. In this algorithm ,the biggest block in the some level of the webpage is been marked as the subject information block. In order to verify the correctness and effectiveness of the algorithm, this paper carried out a series of experiments on the part of the test set from CWT200G. The experimental results show that the algorithm has high accuracy.Finally, base on the proposed model and algorithm, this paper using c# language to implement a WLB_DOM-based noise reduction system prototype on the Microsoft Visual Studio 2008 development platform.
Keywords/Search Tags:noise reduction, web page noises, DOM, page segmentation
PDF Full Text Request
Related items