Font Size: a A A

Research On Chinese Text Classification And Its Applications

Posted on:2011-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:J MeiFull Text:PDF
GTID:2178360302988234Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development and maturity of information technology,especially the Internet-related technology,people can obtain more and more information.Faced with a deluge of information,on the one hand,people have a desire for fast,accurate and comprehensive access to information.On the other hand,information stays in an unexpected ways and thus looks disorderly.As a key technology of processing and organizing vast text data,text classification can solve the information disorder problem to a great extent.It has very realistic significance for efficient management and effective utilization of information.Some problems in text classification are researched in this thesis.With improvement of text classification performance as the main line,this thesis deeply analyses the key technologies of text classification, which include text representation model,text preprocessing,feature selection,feature weighting,classification method,classification performance evaluation and so on. Furthermore,this thesis respectively proposes novel methods for feature selection and feature weighting.The main research contents of this thesis are as follows:(1) For the problems of high dimension feature space and feature redundancy in text classification,this thesis proposes the two-step feature selection method which combined the ECBF algorithm with a feature selection method based on feature distribution,which is proposed by this thesis.Thereinto,the feature selection method based on feature distribution is used to treat with the problem of high dimension feature space,which can sift through features from feature set at one time.The ECBF algorithm can reasonably measure degree of redundancy among features.It is used to treat with the problem of feature redundancy.Therefore,the proposed two-step feature selection method can not only select suitable features for text classification, but also reduce plenty of redundant features.The performance of text classifier is improved consequently.(2)For the problem of feature weighting in text classification,the thesis firstly analyses the TF-IDF method which is the most classical and common feature weight estimation method in detail.This method only consider the capability to distinct a text from another text of feature,and do not introduced into the capability to distinct a class from another class of feature.Through analyses the character of the Na(i|ยจ)ve Bayes classification and the TF-IDF method,this thesis proposes a advanced feature weight estimation method.This method can give a suitable weight to feature according to the class-distinct capability.This thesis respectively proposes a method from two different aspects,namely, feature selection and feature weighting.And these methods improve performance of text classification in different degree.
Keywords/Search Tags:text classification, feature selection, feature weighting, two-step feature selection, TF-IDF, Na(?)ve Bayes
PDF Full Text Request
Related items