Font Size: a A A

Research And Realization Of Term Selection In Chinese Web Page Classification Based On VSM

Posted on:2013-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhouFull Text:PDF
GTID:2248330374451570Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of information technology, the automatic web page classification technology has become one of the most attractive research focuses in the field of Web, it has been widely used in information retrieval, information filtering, and some other fields. Feature selection is an important link in automatic web page classification; it selects terms which has strong ability to distinguish categories from original feature space to reduce the dimensions of web page text vector space, and improves classification efficiency and accuracy of classifiers.This paper expounds each step in the process of automatic Chinese web page classification, and realized cleaning web pages, segmenting Chinese words, removing stop words, selecting feature terms, weighting feature terms, generating vector space model and some other reprocessing operations in Chinese web page classification. The key research in this paper is the feature selection methods based on statistical learning such as document frequency (DF), Chi-square statistics (CHI) and the information gain (IG) feature selection methods. Experiments are done to the above three feature selection methods to compare their classification performance, and found that the feature selection based on CHI is superior to DF and IG, and DF method in a particular quantity of feature terms performance equal to CHI method. Although the classification accuracy of IG method can not compare with DF and CHI methods, but its classification stability is equal to CHI and better than DF method.Based on the analysis of the traditional feature selection methods, this paper presents some improvement measures according to their deficiency.According to the traditional DF method’s bias to global high frequency terms, the feature terms selected by DF is distributed uneven among categories, leading to some categories have low classification performance. This paper puts forward a new feature selection method based on relative DF in each category. The modified DF method first selects terms in each category and merged the local feature terms.The traditional Chi-square statistics method is relying heavily on the terms owning high concentration information but low document frequency and low representative when the number of feature terms increasing to a certain degree leading to a vertically dropping in classification performance. This paper puts forward another new feature selection method that combining the DF threshold method and the traditional CHI method. The new CHI method removes the global high DF terms and respective low local DF terms improving the defects that heavily relying on low frequency terms in the traditional CHI method.According to the traditional IG method’s poor performance in classification, this paper make a overall improvement of it by comprehensive merging concentration information, dispersion information and term frequency into the evaluation function of traditional IG method. In addition, this paper used local IG feature selection replacing the traditional method selecting terms owning maximum evaluation value among all categories.With the implementation of the above modified methods, the program generated both training and testing web pages’vector space model (VSM) and make them as the input of a classifier. According to the result of extensive experiments, this paper make a conclusion that all the above modified feature selection methods can improve the performance of the classification system.
Keywords/Search Tags:Web page classification, Term selection, Vector space model (VSM), Information gain, Chi-square statistics
PDF Full Text Request
Related items