Font Size: a A A

Text Categorization And Feature Dimension Reduction Research

Posted on:2013-10-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X LiaoFull Text:PDF
GTID:1228330395989256Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, especially the popularization of Internet, the information capacity increases explosively. There is a great desire to develop a technology which can organize and manage information efficiently. Text classification can solve the information chaos to a great extent as a key technology of processing and organizing vast text data. It has very realistic significance for efficient management and effective utilization of information and has gradually been an important research direction in the field of data mining.Now text classification is widely applied in many fields and great progresses have been made. For example, information filtering, information retrieval, word sense disambiguation, news distribution, mail classification, digital library, text database and so on. In addition, more and more scholars commit themselves to the researches on text classification. Many novel methods and techniques of text classification emerge. However, text classification also encounters unprecedented challenges. There is a broad development space for researches on text classification in theory and practice.In the thesis, the research background, the research significance and the research situation of text classification at home and abroad are described firstly. Then the conception of text classification, text preprocessing, text representation model, feature selection, feature weighting, classification method, and classification performance evaluation are described. On these bases, we study deeply the text classifier and feature dimension reduction techonlogy. The main research contents of this thesis are as follows.(1) A text classifier based on cloud model(CMTC) is proposed.First, the parameterσ is introduced in CMTC classifier to solve the problem that traditional classifier based cloud model can’t be used in text classification because of sparse feature space. Then the relation between parameter a and the classification performance is analyzed through experiments. As the result of analysis, the proper σ value is selected. Experimental results show that CMTC classifier can deal with the imbalanced dataset Reuters10(a subset of Reuters-21578) better than SVM and KNN, especially the maximum of Macro_F1is5.06%higher than that of KNN and6.19%higher than that of SVM respectively. When tested over Fudan Chinese dataset, it is observed that the classification performance of the CMTC classifier is at least equal to, and sometimes better than that of KNN classifier. The CMTC classifier is greatly improved when compared with that of SVM classifier.(2) The feature selection method based on backward cloud model is proposed.First of all. the model of each feature in each class can be expressed according to theory of backward cloud model, and the distinction of each feature between different classes can be calculated. The features with larger distinction between classes are selected. In addition, the frequency of the features is considered. This method is easy which has lower time complexity and space complexity. The experimental results show that the performance of the proposed feature selection method is comparable with that of IG(information gain) and higher than that of WET(weight of evidence) and MI(mutual information).(3) The strong class-related feature selection method on imbalanced dataset is proposed.Firstly, a new measurement of strong class information is proposed after the four basic information elements are analyzed which construct the traditional feature selection methods. Based on it, a new feature selection method applicable to imbalanced text classification is proposed. The method considers the strong class information and the frequency of terms which improve the classification performance of minority classes and majority classes respectively. The experimental results on ReuterslO show that the Micro_Fl of the new feature selection method is2.12%higher than that of IG,1.91%higher than that of CHI and1.91%higher than that of DFICF when using SVM. The Macro_F1value is1.21%higher than that of IG,1.55%higher than that of CHI and1.14%higher than that of DFICF. When using the Naive Bayes classifier the Micro_Fl of the new method is1.08%higher than that of IG,1.76%higher than that of CHI and0.79%higher than that of DFICF. The Macro_Fl is0.75%higher than that of IG,2.85%higher than that of CHI and0.41%higher than that of DFICF.(4) A feature extraction method based on Sprinkling is proposed.Firstly, local weight and global weight of features are considered in this method. Then, the membership degrees of samples are considered. The membership degree information is defined with Descending Half Cauchy distribution. At last, we augment the feature set with one artificial term corresponding to a class label, at the same time, we adjust the weight of artificial term to improve classification performance. In addition, the influence of weight of artificial term on text classification performance is discussed. The results show that the classification performance of the new method reaches the maximum94.22%which is1.71%higher than that of the original Sprinkling method when the weight of augmented feature is2. The classification performance of the new method is improved in some degree.
Keywords/Search Tags:Text classification, Cloud model, Feature selection, Imbalanced text, Featureextraction
PDF Full Text Request
Related items