Font Size: a A A

Web Page Noise Reducing Based On Tag Feature Vector

Posted on:2011-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ChenFull Text:PDF
GTID:2178360302964545Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Internet provides us all kinds of information published as web pages. However, a large amount of noise usually goes with the useful information, such as navigation panels, advertisements, copyright notices, etc.There are some methods which can remove the web noise and achieve a good result, such as the CSS model and block-based visual approach. However these methods still have some problems. These methods are often limited to a certain type of web pages. Some approaches need to set a lot of thresholds to work. Some models lack generalities.In this paper, we propose a tag feature vector to deal with the web page noise.First of all, establish the Vector Set of the leaf nodes of the DOM tree. Mixed-nodes occur because of nonregular use of tags. Normalize the DOM tree to against lost the useful textual information. By the definition of features, traverse the DOM tree and mark the nodes with feature's value to build feature vector set.Secondly, use clustering algorithm to cluster the feature vector set into K classes.Finally, select the class which has the strong text characteristics and take a further step to deal with noise of the target class.This approach has two features. First, as an important contribution of the paper, take the web nodes into space points, so that the current popular data mining techniques, such as clustering algorithm, can be successfully used. Secondly, it is not need a large number of pages to establish the model.It can effectively overcome the shortcomings of some other models. First of all, this method only uses one parameter rather than many ones, so it makes less dependence on the parameters. Secondly, the model is not limited to a certain types of web pages.Experimental results show that this method can be used to reduce noise with a good result for different types of web pages.
Keywords/Search Tags:web page noise reducing, text extraction, feature vector, cluster, DOM tree
PDF Full Text Request
Related items