Font Size: a A A

Studies On Some Essential Problems In Automatic Text Categorization

Posted on:2005-12-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:F X SongFull Text:PDF
GTID:1118360125953580Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Some essential problems in learning-based automatic text categorization (TC) have been studied in this dissertation. A series of research achievements have been obtained as the consequences of this study. The main points of the study are summarized as follows.Performance evaluation for text categorizationAfter studying the characteristics of multi-label text categorization, the importance of proper use of performance evaluation metrics is emphasized. The strength and weakness of various conventional performance evaluation metrics such as Break-even Point, F1-Value, recall-precision ROC curve, etc., are discussed respectively. Two novel metrics named as the ROC curve of the Rate of Rejecting the True vs. the Rate of Receiving the Fault and the Risk Balance Value are proposed in this dissertation. The two new metrics are easy to calculate and understand. Text representationIt is well known that the performance of a text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. Five text representation factors: stop-word removal, word stemming, indexing, weighting, and normalization are considered in this dissertation. Statistical analyses of extensive experimental results show that performing "normalization" can always promote performances of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, removing stop words from the vocabulary is not helpful if is not harmful.Character N-Gram (CNG) is one kind of language-independent text represent- tation. Duo to its weaknesses such as great data noise, calculating complexity, inclined to overtraining, and etc., it is generally believed that it cannot compete with the prevailing text representation approach, i.e., the bag of words (BOW). With the development of computer hardware and the advent of high-performance learning algorithms such as support vector machines, it is necessary to reexamine its representation ability owing to the difficulty of word segmentation in Eastern language. Statistical analysis of the results of extensiveexperiments conducted on the benchmark dataset Reuter-21578 show that CNG is not inferior to BOW significantly.Improving text representation by means of feature weight vector (FWV) is a common naive idea in text categorization field. But experimental results concerning the performance of FWV reported in TC literature are incompatible. In order to explain the previous results we studied the impact of FWV on the performance of Bayesian classifier theoretically. Surprisingly, we find that FWV cannot promote the performance of Bayesian classifier.Text feature selectionIn text categorization one is usually confronted with feature spaces containing 10,000 dimensions and more, often exceeding the number of available training samples. To make the use of conventional learning methods possible feature selection is generally indispensable.Five novel feature selection metrics named as low loss dimensionality reduction (LLDR), relative frequency difference (RFD), Bayesian rule (BR), F\-value (FV), and Fisher discriminant (FD) are presented in this dissertation. Extensive experimental results show that LLDR and RFD are at least as good as or even better than Mutual Information and Chi-square Statistic which are two best conventional feature selection measures.Text classifierThe classic linear Fisher discriminant (LFD) for binary classification finds the projection weights that map sample data onto a line such that along that line within-class variance is minimized while between-class variance is maximized. Maximization of the Fisher discriminant function becomes an ill-posed problem when the within-class scatter matrix is singular. Thus, how to deal with the singularity of the matrix becomes one of the basic tasks of LDA. Unlike LDA, the large margin linear projection (LMLP) classifier presented in this dissertation, which also roots in linear Fisher discriminant, takes full advantage of the singularity of wi...
Keywords/Search Tags:pattern recognition, text categorization, evaluation measure, text representation, feature selection, classifier
PDF Full Text Request
Related items