Font Size: a A A

A Study On Key Issues Of Automated Text Categorization For Chinese Documents

Posted on:2005-05-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:D J XueFull Text:PDF
GTID:1118360152968077Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
1. Constructed a large-scale document set which consists of 71674 Chinese documents and 55 categories. With multinomial Bayesian classifier, examined the performances of Chinese character unigram features and bigram features in text categorization. F1-measure of bigram is superior to unigram by 7.7%, and to the combination indexing with both unigram and bigram.2. Proposed the concepts of categorization capacity and representation capacity of features. The categorization capacity ensures the classifier distinguishes documents amony categories, and the representation capacity ensures the categorization is implemented with respect to document content. According to contributions to categorization, classified features to three types: strong information features, weak information features and irrelevant features. Proposed a new feature selection criterion which leads to a good balance between categorization capacity and representation capacity for features. With 70000 selected features, F1 of is better than by 3.1%, and by 5.8%.3. Found a large number of high-degree overlapped bigrams and high-degree biased bigrams existing in bigram feature set. Put forward a novel feature reduction method, , which raises the δ-degree overlapped bigrams into corresponding tirgrams. Proposed two methods, and , to tackle the high-degree biased bigrams. deletes the σ-degree biased bigrams from feature set directly, while degrades them down to unigrams through their important characters. After feature selection with , further leads to a aggressivity of 6.2% for feature reduction, and 11% without sacrifice of performance. By integrating four dimension reduction methods, tf, , and , brought forward a multi-step reduction strategy, in which the last two steps gain a 26% agg together without performance loss.4. Attacked the problem of feature weighting in two opposite directions. First is to the sophisticated direction. Proposed two novel feature weighting methods, by combining and statistics, and by combining statistics , , and which is brought forward in the thesis as well. On the 70000-sized feature set selected with criterion, F1 of and is 5.7% higher than the well-known weighting scheme , and 3% higher on the same size feature set formed with criterion. Second is to the simple direction. Put forward the binary weighting method BW which is dependent on large-sized feature sets. In order to deal with the problem of categorization uncertainty which BW is constantly faced with on a small-sized feature set, the thesis further proposed BW+numeric weighting smoothing method BW+NWS. The outstanding weighting scheme improves categorization performance considerably and is independent of complexity of numeric weighting methods. On the 70000-sized feature set formed with criterion, BW+NWS gets to an impressive F1 of 97.7%, by 16.6% higher than weighting scheme. 5. Investigated the performance of Chinese word features in text categorization in detail, and compared with character bigram features systematically. The conclusions drawn from bigram features are fit for the Chinese word features as well. Bigram feature gains a better performance than word feature, because bigram feature shares the advantages of both Chinese word indexing and character n-gram indexing, and is qualified statistically.
Keywords/Search Tags:Chinese text categorization, bigram feature, word feature, feature selection, feature weighting
PDF Full Text Request
Related items