Font Size: a A A

Research On Some Problems In Text Classification

Posted on:2010-12-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:1118360272995657Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of information technology and popularization of the Internet, large numbers of information can be acquired conveniently and quickly. However, how to quickly and accurately find the needed information in the vast information ocean has been a realistic problem which people have to face. So organization management and efficient utilization of massive information have been urgent to be solved. At the present time, most information is shown as text. For effective utilization of information, the efficient and reasonable classification for them is very necessary. Therefore, text classification has become a key technology for vast text information processing and has gradually been an important research direction in the field of data mining.Text classification is the process of distinguishing given text into one or several predefined text classes according to its contents. As a key technology of processing and organizing vast text data, text classification can solve the information disorder problem to a great extent and facilitate users for accurately positioning and shunting needed information. It has very realistic significance for efficient management and effective utilization of information. Now text classification has achieved great development and is widely applied in the fields, such as information filtering, information organization and management, information retrieval, word sense disambiguation, mail classification, news distribution, digital library, text database and so on.With the extensive application and continuous development of text classification, more and more scholars devote their attention to researches on text classification. Not only do the novel methods of text classification emerge in endlessly, but also the new systems of text classification change with each passing day. At the same time, text classification also encounters unprecedented challenges. Text classification should not only make the best of theory and method of data mining but also adapt for the imprecision and uncertainty of text data sets. Thereby, no matter in theory or in practice, there is big development space for researches on text classification.This thesis firstly introduces the research background and significance of text classification and detailedly presents the research status at home and abroad of text classification. Then, this thesis describes the conception of text classification, the definitions of single-label and multi-label text classification, the characteristics of text classification and the process of text classification. At last, on the basis, some problems in text classification are researched in this thesis. With improvement of text classification performance as the main line, this thesis deeply analyses the key technologies of text classification, which include text representation model, text preprocessing, feature selection, feature weighting, classification method, classification performance evaluation and so on. Furthermore, this thesis respectively proposes novel methods for feature selection, feature weighting and classification method of text classification. The main research contents of this thesis are as follows:(1) For the problems of high dimension feature space and feature redundancy in text classification, this thesis proposes the feature selection method based on maximal marginal relevance. This method carries out feature selection for text classification according to the formula, which combines statistics and maximal marginal relevance. Thereinto, statistics is a feature selection method which has good effect, and it is used to treat with the problem of high dimension feature space. Maximal marginal relevance denotes degree of correlation between the feature being considered and the feature been selected, and it can reasonably measure degree of redundancy among features. It is used to treat with the problem of feature redundancy. Therefore, the proposed feature selection method can not only select suitable features for text classification, but also reduce plenty of redundant features. The performance of text classifier is improved consequently. Afterward, using Na?ve Bayes, Rocchio and kNN classifiers, the related comparison experiments are carried out for CHI, IG and the proposed method by this thesis on the Reuters-21578 and OHSCAL normal text data sets. The experimental results show that the proposed feature selection method is more effective than the traditional CHI and IG feature selection methods,χ2χ2 and it can significantly improve the classification performances of the Na?ve Bayes, Rocchio and kNN classifier. The maximum values of micro-averaging F1 of the three classifiers are close to even exceed the value of micro-averaging F1 of the SVM classifier which directly use all features.(2) For the problem of feature weighting in text classification, the thesis firstly analyses the TF-IDF method which is the most classical and common feature weight estimation method in detail. This method only considers feature frequency and distribution of feature in the whole sample set and does not introduced existing decision information into feature weight. So there is limitation of improving classification accuracy. The thesis deeply analyses the rough set theory proposed by Pawlak and finds that decision information can be introduced into feature weight by the conception of approximation classified quality. However, the computation of approximation classified quality demands that data in decision table are discrete. To avoid the bad consequence caused by unmerited discretization, this thesis proposes the conception of feature importance based on real rough set theory. Decision information of feature to classification can directly be introduced into feature weight by this conception, and it can more objectively reflect importance degree of feature to classification. According to this conception, the feature weighting method based on feature importance is proposed. Secondly, the comparison experiments are carried out for TF-IDF, RW and the proposed feature weighting method by this thesis on two normal text data sets, namely, Reuters-21578 Top10 and WebKB. The texts in the two data sets are respectively weighted by TF-IDF, RW and the proposed feature weighting method. Through drawing the spatial distribution maps of weighted samples, it is qualitatively explained that the proposed feature weighting method by this thesis can make samples in the same class more compact and those in different classes looser. And through computation of total within-class scatter and between-class scatter in Fisher linear discriminant, it is quantitatively explained that the proposed feature weighting method by this thesis can decrease total within-class scatter of sample set and increase between-class scatter of sample set. Through two different modes, it is shown that the proposed feature weighting method by this thesis can significantly improve spatial distribution of samples and simplify mapping relationship from samples to classes. Finally, using Na?ve Bayes, kNN and SVM classifiers, the comparison experiments are carried out for the effects of TF-IDF, RW and the proposed feature weighting method on the above two text data sets. The experimental results show that the proposed feature weighting method by this thesis can improve macro-averaging precision, macro-averaging recall and the value of macro-averaging F1 for classification.(3) For the problem of rule extraction in text classification, this thesis proposes the rule extraction method for text classification based on multi-population collaborative optimization. The method employs information entropy to generate initial populations. According to average information of feature, information entropy method computes probability of inserting feature to the current rule, and it can effectively reduce time of rule extraction. And then, the rule extraction method proposed by this thesis applies the multi-population collaborative optimization method to evolve the current population. The multi-population collaborative optimization method is designed from the phenomenon that various biologies learn from others' strong points to offset own weakness and coevolve through information exchange in nature. The whole method is composed of several common populations and one excellent population in computing environment. Each common population competes for computing resources through the mutual competition mechanism in computing environment. At the same time, each common population contributes excellent individuals obtained by evolution to compose excellent population. Common population can obtain excellent individuals from excellent population to improve quality of this population. The multi-population collaborative optimization method improves optimization efficiency through the mechanisms of mutual competition and excellent individuals sharing among populations. Lastly, the comparison experiments are carried out for the CN2, FDT, Ant-Miner, LSVM and the proposed rule extraction method by this thesis on three normal text data sets, namely, Reuters, 20 Newsgroups and Web. The experimental results show that the number of the rules extracted by the proposed rule extraction method is less, the accuracy of these rules is high and the average length of them is short. Furthermore, the time of the proposed method by this thesis is short and the speed of rule extraction by this method is high. Therefore, the proposed rule extraction method by this thesis is suitable for large-scale data sets.In summary, from three different aspects, namely, feature selection, feature weighting and classification method of text classification, this thesis respectively proposes the feature selection method based on maximal marginal relevance, the feature weighting method based on feature importance and the rule extraction method for text classification based on multi-population collaborative optimization. And the three methods improve performance of text classification in different degree. Therefore, the research work of this thesis not only extends research thinking of text classification, but also contributes to performance improvement of text classification.
Keywords/Search Tags:text classification, feature selection, feature weighting, rule extraction, maximal marginal relevance, feature importance, multi-population collaborative optimization
PDF Full Text Request
Related items