Font Size: a A A

The Research Of Feature Selection Method In Text Classification Based On Triple-Play

Posted on:2014-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:X W GanFull Text:PDF
GTID:2268330422964524Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the development of Computer Technology, the development and popularization of the Internet and continuously push forward the convergence work,More and more information is available.Most of the information is in the form of text.Development of new technology makes life more colorful,But massive information also makes it difficult to select the information that is useful to them. So the study on the classification of information seems more and more important. Solve the Problem of information chaos phenomenon, enabling users to accurately and easily locate the information you need and Fan information. The problem is studying in automatic text categorization is characteristic of high-dimensional vector space.Feature selection is a common feature of the vector space of dimension reduction methods.Feature selection method has a big impact on the effects of text categorization:mutual information feature selection method and characteristics of frequency TFIDF feature selection method is better feature selection method.Mutual information feature selection method is the method what studies on the characteristics and the degree of association between the categories, and characteristics of frequency method is the method which studies on characteristics of probability in the text.But both methods have certain deficiencies Mutual Information method relied too heavily on low frequency words what led to a significant loss of useful feature selection sort of information. And when the mutual information of character entry Value is large,some of the characteristics associated with the feature words may be selected.This has resulted in feature redundant.Which doesn’t take into account distribution of characteristics between the various categories is characteristics of frequency TFIDF feature selection method,also does not take into account characteristics in the distribution between the categories within the text. Predecessors made improvement methods for that there is not enough for mutual information and characteristics of the frequency.The paper is predecessors to give improved on the basis of the programme.It introduces lack of regulatory factors to improve the mutual information method.It will join the characteristic frequency of information entropy ideas in the TFIDF method to improve.And then improved that combines the two methods of use,proposed a new combination of improved feature selection method.Finally, through the design of Chinese text classification system to do the experiment.Testing to improve the effectiveness and feasibility of a combination of feature selection methods.Using a combination of improved feature selection methods, respectively.Traditional mutual information feature selection method,and the traditional TFIDF feature selection method to select the characteristic words.Using feature words for text classification.Then use the Recall,Precision Ratio and the F1value as the three evaluation criteria to analysis the results of the classification. By comparing the form of a chart proved that feature selection in text categorization method of improving combination effects is better than traditional approaches to mutual information and the TFIDF, to be effective.
Keywords/Search Tags:Text classification, Feature selection, Mutual information, The characteristics of frequency
PDF Full Text Request
Related items