Font Size: a A A

Page To Noise And The Classification Algorithm

Posted on:2009-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:C X LiuFull Text:PDF
GTID:2208360272957584Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the development of Internet, the information in the Internet increases rapidly. Internet offers us abundant information sources, but also put us a challenge to quickly get the information which we need. In order to use information sources efficiently, we need to classify these sources.This paper researches the key technologies of Webpage categorization and make an intensive study of the technology of Webpage preprocesses and classification algorithms.The key issue of Webpage preprocesses is wiping off the noise data in WebPages, such as advertisement, navigation bars, copyright etc, so as to get the main content information. In view of the information of the Web pages, such as the size and the position of the web block, and base on the features of the Web pages, we propose a block-analysis-based and Auto-adjusted threshold approach to eliminate the noise content in Web pages.In view of the relation of character word, the feature combination algorithm combines the character words which have same contribution on classification into one pattern. And the pattern was used as the basic feature dimension. We test the performance of this algorithm in our Web page classifier, and the result shows this algorithm has nicer effect in skewed data set, and also improved the performance of classifier.Many categorization algorithms have been excogitated in the text categorization field, and KNN and SVM are deemed to be better than others. Combining KNN with SVM, we propose an algorithm called SVM-KNN. The algorithm improves the performance of classifiers by modifying classify results according to feedback of predict result's probability.Finally, some experiments of Chinese Webpage categorization are given. In the experiments, we test the performance of our block-analysis-based and Auto-adjusted threshold approach and the algorithm SVM-KNN. The experimental results verified the validity of the approaches presented in this paper.
Keywords/Search Tags:Webpage categorization, Webpage noise, noise elimination, feature combination, SVM-KNN
PDF Full Text Request
Related items