Font Size: a A A

Research And Application On Feature Selection Algorithms Based On Term Distributions In Text Categorization

Posted on:2017-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:J GuoFull Text:PDF
GTID:2348330536476785Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet,the number of electronic documents has drastically increased in recent years.So it is difficult to find what people really want in the massive data.Facing such a situation,how to organize and manage such data efficiently has become an important problem.Text categorization is the key to solve it.It can help people locate the necessary information efficiently and accurately.This paper introduces the basic procedure of text categorization in detail,including text preprocessing,text representation,feature selection,feature weighting and classification algorithm.Besides,this paper focuses on feature selection and proposes two new feature selection algorithms.?1?A feature selection algorithm based on term frequency is proposed.Through analyzing the existing feature selection algorithms,it is not difficult to find that DF,IG and MI methods almost use document frequency.In fact,term frequency also has a great influence on feature selection.So far,few effective methods have been proposed in the perspective of term frequency.So this paper proposes a feature selection algorithm based on term frequency,in which term frequency,the inter-class and the intra-class distributions of the terms are all considered synthetically.And the corresponding results show that the algorithm proposed in this paper has achieved a high performance and it is an effective feature selection algorithm.?2?A feature selection algorithm based on relative contribution of terms is proposed.DF,IG and t-Test methods are inclined to select high-frequency terms as features,and their performances are good.CTD and SCIW methods consider the category information,and they also have good accuracies.It is easy to be found that high-frequency terms and the category information of such terms are both very important factors to improve the classification performance.So this paper proposes a feature selection algorithm based on relative contribution of terms.Additionally,term frequency and the relative contribution of terms are considered sufficiently.The experimental results show that the algorithm proposed in this paper has a good classification performance in terms of precision,recall,macro-F1 and accuracy.
Keywords/Search Tags:Text categorization, Feature selection, Term frequency, Relative contribution
PDF Full Text Request
Related items