Along with the rapid development of our national information industry, there are more and more Chinese text information on the net. The conventional algorithms can not meet the requirement of current Chinese text classification task, such as high dimensionality, high volume and high readability. Therefore, it is very necessary to develop classification system of Chinese text.A recent proposed classification method association text classification has high classification efficiency and good readability, and has good result applied in the Chinese text categorization. However, because of basing on confidence-support framework, this method can't find the rules with low confidence, and has some inherent defects.To find the useful rules with low support, the correlation between the items of the rules is researched. A correlation rules mining algorithm is proposed to take into account of the correlation of items. This algorithm can find low confidence and high correlation rules, and has more practical significance than the traditional association rules. A new method to use the correlation rules to classify Chinese text is proposed to increase the correctness and efficiency of classification. New algorithms called PCM and NCM is devised to use the lower and upper bound of Phi correlation coefficient to generate all candidate negative and positive correlation items and reduce explosive search space. Negative and positive correlation rules are mined using reliability measure.In accordance with the linguistics characteristics of Chinese words, a prefix-hash-tree data structure is designed to convert Chinese document into transaction data. Algorithm to classify Chinese text using the correlation rules is proposed. And a Chinese text classification prototype system is designed to test the algorithm.In the experiment, the People's Daily corpus is used to test the classifier. The corpus contains 10 categories including environment, computer, and politics and so on, 2815 files in all, and word reached 17.7 M. The results show that this system is quite efficient and accurate in Chinese text transformation and classification. |