Improved Feature Selection Methods For Web Pages Based On DIV Iterative Search And Information Gain

Posted on:2016-11-25

Degree:Master

Type:Thesis

Country:China

Candidate:D Liu

Full Text:PDF

GTID:2308330461484242

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Web pages on the Internet increase with exponential growth rate, the increasing number of web pages has brought great challenge for information processing of Internet. The automatic web page classification technique makes the acquisition and processing of information more effective and convenient. This paper first introduces text classification, and gives the general procedure of web page classification based on it. Then this paper introduces the key technique used in web page classification, including web crawler, web denoising, Chinese word segmentation, stopword removement and so on. In addition, this paper also introduces several classical methods of feature selection and feature weight calculation.In order to get effective web contents for web pages classification, this paper gives a method of web principal contents extraction based on DIV iterative search. First of all, this paper gives a web template detection method based on similarity of structure and contents. Then extract the main web contents by iterative searching for DIV blocks based on the found template and the predefined rules. The experiment results show that, the methods present in this paper can extract the main web contents effectively.In order to reduce the dimension of feature space and improve the accuracy of web page classification, this paper gives an improved algorithm based on the existing feature selection algorithm. For the problem of existing feature selection algorithm usually ignores the words association between each other, this paper gives a feature selection method based on association. First of all, this paper clusters the feature set according to the correlation of words, and reserves the cluster centers as candidate features to reduce the redundancy. For the problem of the feature selection methods usually lack class distinction, this paper gives a feature selection method based on class distinction, on the basis of the calculation result of IG, this paper selects feature words for each class separately and takes the distribution between and in class of words into consideration. While this paper also uses a scale factor to reduce the impact of the situation that the words do not exist.In the end, this paper introduces the support vector machine used in web page classification, this part mainly analysis the reasons why support vector machine is suitable for web page classification and the classical SVM tool-libsvm. In addition, this paper carries on experiments to verify the given methods, and the results show that the works in this paper can improve the performance of web page classification.

Keywords/Search Tags:

webpage classification, information gain, association, class distinction, support vector machine

PDF Full Text Request

Related items

1	Web Pages Classification Based On Active Learning Support Vector Machine Learning
2	CARSVM: Classification by integrating class association rules and support vector machine
3	Research On Support Vector Machine Classification Algorithm For Multi-class Texts
4	Research And Implementation Of Web Page Classification Based On CNN And SVM
5	Support Vector Machine And Its Applications
6	Research On Robust Least Square One-class Support Vector Machines
7	Research And Application Of Multi-Class Classfiction On Support Vector Machine
8	Research On Text Classification Method Based On Support Vector Machine
9	Multi-class Classification Algorithm Research Based On Fuzzy Support Vector Machines
10	The Research Of Classification Algorithm Based On Support Vector Machine