Font Size: a A A

Improved Feature Selection Methods For Web Pages Based On DIV Iterative Search And Information Gain

Posted on:2016-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:D LiuFull Text:PDF
GTID:2308330461484242Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Web pages on the Internet increase with exponential growth rate, the increasing number of web pages has brought great challenge for information processing of Internet. The automatic web page classification technique makes the acquisition and processing of information more effective and convenient. This paper first introduces text classification, and gives the general procedure of web page classification based on it. Then this paper introduces the key technique used in web page classification, including web crawler, web denoising, Chinese word segmentation, stopword removement and so on. In addition, this paper also introduces several classical methods of feature selection and feature weight calculation.In order to get effective web contents for web pages classification, this paper gives a method of web principal contents extraction based on DIV iterative search. First of all, this paper gives a web template detection method based on similarity of structure and contents. Then extract the main web contents by iterative searching for DIV blocks based on the found template and the predefined rules. The experiment results show that, the methods present in this paper can extract the main web contents effectively.In order to reduce the dimension of feature space and improve the accuracy of web page classification, this paper gives an improved algorithm based on the existing feature selection algorithm. For the problem of existing feature selection algorithm usually ignores the words association between each other, this paper gives a feature selection method based on association. First of all, this paper clusters the feature set according to the correlation of words, and reserves the cluster centers as candidate features to reduce the redundancy. For the problem of the feature selection methods usually lack class distinction, this paper gives a feature selection method based on class distinction, on the basis of the calculation result of IG, this paper selects feature words for each class separately and takes the distribution between and in class of words into consideration. While this paper also uses a scale factor to reduce the impact of the situation that the words do not exist.In the end, this paper introduces the support vector machine used in web page classification, this part mainly analysis the reasons why support vector machine is suitable for web page classification and the classical SVM tool-libsvm. In addition, this paper carries on experiments to verify the given methods, and the results show that the works in this paper can improve the performance of web page classification.
Keywords/Search Tags:webpage classification, information gain, association, class distinction, support vector machine
PDF Full Text Request
Related items