Font Size: a A A

Research On Feature Selection In Web Page Classification

Posted on:2018-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:L K LiuFull Text:PDF
GTID:2428330569485293Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Feature reduction is the key technology in web page classification system.Excellent feature reduction method is an effective way to realize efficient classification of web pages.Feature selection is an effective way to reduce the dimension of feature,and its effect on the selection of feature words is a direct factor affecting the classification effect.In the actual web page classification,the data set is limited and defective.The existing feature selection method has the unsatisfactory effect of selecting the feature words,which leads to the low value of MicroF1(micro-average F1)of the web page.In this paper,the traditional CHI(chi square test)feature selection method is analyzed detailedly.It is found that the traditional CHI feature selection method has the following drawbacks when the features of different data sets are reduced:(1)Low frequency word defects;(2)Negative correlation between feature words and classification;(3)Susceptible to data set balance and completeness.The existing improvements have solved the first two problems,but there are still problems that are affected by the balance and completeness of the data set.In this paper,we introduce the word vector to improve the CHI feature selection method for the defects of the CHI feature selection method to reduce the feature of different data sets.The main ideas of the improvement are as follows: Firstly,the formula of CHI is improved by using the frequency words of the classification words,and the negative correlation characteristics of the feature words and classification are neglected to solve the problem that the traditional CHI feature selection method neglects the negative correlation between the word words and the characteristic words and classification;Using the improved CHI feature selection method to select the feature words of different data sets,and then use the word vector to select the feature words to expand the subset of feature words,to solve the CHI square test feature selection method of the characteristic word selection results susceptible to data set balance and completeness of the problem.In this paper,we use different feature selection methods to select the characteristic words in the set of feature words,and to verify the validity and feasibility of the improved CHI method proposed in this paper.In this paper,we use the MicroF1 to evaluate the classification effect of all the web pages in the test set.The experimental results show that the CHI feature selection method proposed in this paper can get better classification effect for different data sets.
Keywords/Search Tags:Web page classification, Feature reduction, Feature selection, Distributed representation
PDF Full Text Request
Related items