Font Size: a A A

Text Classification Study Based On Cross Cover Algorithm

Posted on:2008-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:J B LiFull Text:PDF
GTID:2178360215496461Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text automatic classification is of more and more concern in the last 10 years. It is atechnology assigning non-structured text to one or several predefined categories according tothe content. Mainly due to the prevalence of large quantities of electric text reader, theystand in need of effective classification to deliver searching and reading speed. Even nowmany technologies and algorithms used in text automatic classification, it is not nearlyenough for the discovery of their own effects, and still leaves us plenty of improvementspace. Besides, there are new means of classification to be deeply studied. In particular, tothe automatic classification of Chinese character, the development work before iscomparative little, still less the reputable Chinese text classifiers.Text classifier is a vital part not only for studying algorithm, but also for the result ofclassification. Before learning algorithm and classification system able to be used on text, ithas to be transformed into a proper representation., which is able to seize the semanticcontent to a certain extent. According to the former demand, the technology of Chinesecharacter text classification can be classified as text data compilation, Chinese character textclassification, reduced dimension calculation on high dimensionally primary characteristicsspace, selection of classifiers, and valuation on results of classification, etc.What this paper illustrates as follows:1. Introduction of the correlation concept of text classification and the existing methodsof it;2. In order to gain the useful information from the classified results, this paper usesdifferent feature reduced dimension means: Mutual information (MI), CorrelationCoefficient, Document Frequency, and Expectation Crossing Entropy (ECE) to process theclassified results. The experiments demonstrate the Correlation Coefficient method is themost effective; the less effective are Expectation Crossing Entropy (ECE) method andMutual Information (MI) method, while Document Frequency is the worst method.This paper also carries on experiments as comparisons of classifiers between crosscover algorithm and SVM method. It reveals that the cross cover algorithm works very well as a classifier to Chinese text, by the action of the proper dimension and feature reduceddimension.This paper has carried on some work to Chinese text classification, but based on it, stillhas space for enhancement. Therefore, further study on Chinese text classification may belaunched from the following three aspects:1. This text representational model adopts vector space model, As to the vector spacemodel, it combines computer linguistics, and uses the concept space to replace the semanticspace; Taking no consideration of the effect of Chinese words meanings; the ICTCLASclassification results provided by Chinese Accounting Office are used in Chinese textclassification. Later, we can further study how to enhance the precision of the classification.2. Improvement of the cross cover algorithm for enhancement of its classified accuracy;3. The present classification system is a plane system. We may consider the textclassification system in the layer of classification structure, to induce classification goesfrom plane to three-dimensional space, in order to enhance the accuracy of the classifiedalgorithm greatly and speed up text classification.
Keywords/Search Tags:Feature Dimension, Version Classification, Cross Cover Algorithm
PDF Full Text Request
Related items