Chinese Text Categorization Based On Correlation Rules Mining

Posted on:2008-02-09

Degree:Master

Type:Thesis

Country:China

Candidate:T T Zheng

Full Text:PDF

GTID:2178360272468105

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of our national information industry, there are more and more Chinese text information on the net. The conventional algorithms can not meet the requirement of current Chinese text classification task, such as high dimensionality, high volume and high readability. Therefore, it is very necessary to develop classification system of Chinese text.A recent proposed classification method association text classification has high classification efficiency and good readability, and has good result applied in the Chinese text categorization. However, because of basing on confidence-support framework, this method can't find the rules with low confidence, and has some inherent defects.To find the useful rules with low support, the correlation between the items of the rules is researched. A correlation rules mining algorithm is proposed to take into account of the correlation of items. This algorithm can find low confidence and high correlation rules, and has more practical significance than the traditional association rules. A new method to use the correlation rules to classify Chinese text is proposed to increase the correctness and efficiency of classification. New algorithms called PCM and NCM is devised to use the lower and upper bound of Phi correlation coefficient to generate all candidate negative and positive correlation items and reduce explosive search space. Negative and positive correlation rules are mined using reliability measure.In accordance with the linguistics characteristics of Chinese words, a prefix-hash-tree data structure is designed to convert Chinese document into transaction data. Algorithm to classify Chinese text using the correlation rules is proposed. And a Chinese text classification prototype system is designed to test the algorithm.In the experiment, the People's Daily corpus is used to test the classifier. The corpus contains 10 categories including environment, computer, and politics and so on, 2815 files in all, and word reached 17.7 M. The results show that this system is quite efficient and accurate in Chinese text transformation and classification.

Keywords/Search Tags:

Chinese text classification, Correlation rules, Phi correlation coefficient, Rule reliability measure, prefix-hash-tree

PDF Full Text Request

Related items

1	Research On Correlation Rules Mining Algorithm Based On Matrix
2	Decision Tree Algorithm Based On Correlation Feature Weighting Choose Academic Relationship Classification Rule Extraction
3	Research On A Rule-Based Approach To Network Security Event Correlation
4	Text Classification Algorithm Based On Attributes Correlation
5	Differentially Private Decision Tree Based On Pearson’s Correlation Coefficient
6	Research On The Technologies Of Association Rules
7	Research On The Sequential Pattern Mining Algorithms Using Prefix-tree Structure
8	Associated With Technology-based Chinese Text Classification
9	Research And Implementation Of Feature Selection In Chinese Text Classification
10	Communications Network Alarm Correlation Rules Mining Method