Font Size: a A A

Classification Association Rule Induction Algorithm And Applied Research

Posted on:2006-04-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y XuFull Text:PDF
GTID:1118360155459087Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of social informationization, the amount of data increases exponentially. Extracting useful knowledge from large amount of data is significantly important in today's information era. Association rule mining is one of the key tasks of data mining. In particular, the class association rule (CAR), which can be used not only for concept description but also for class prediction and decision-making, has an important role in data mining. And the classification based on class association rules has attracted wide attention from academic research community and industry field. Since the first algorithm of Classification Based on Association rules (CBA) was introduced in 1998, the algorithm design of associative classification and its application have been very active. So far, it is widely accepted that, on the whole, associative classification achieves higher accuracy than traditional decision tree algorithm but it has some deflects such as slow speed, huge memory usage and too many generated association rules that make the classification model complex. In this dissertation, the existing induction algorithms of class association rules are anaylized, the definitions related to associative classification and their formal descriptions are given, a new function based on data mining standard for evaluating a rule's equality in associative classification is proposed, a novel idea for mining the main points of knowledge guided by atomic CARs is introduced and a new classifying technology CAAR ( that is, Classification based on Atomic Association Rules) is created. CAAR can throughly solve the problems of associative classification such as low-efficiency, high cost of memory usage and complex classifying model. Our proposed CAAR algorithm is applied to supervised classification of image content and large-scale data mining, which is very effective. The creative points of this dissertation research are described in following five aspects: 1) A confidence-dominated function based on weighted sum of confidence and support for evaluating classifying rule's equality is proposed for the first time. At present, in the field of machine learning, the widely used standard for assessing a classifying rule is based on the product of sensitivity and specificity, which is traditionally called a Gold Standard. But after analysing the sensitivity and specificity by means of rule's confidence and support, we found that the traditional standard is not as good as our new function which is the weighted sum of rule's confidence and support. To verify our proposition in a huge search space of class association rules, the genetic algorithm was used for experiments. The results show that the newly proposed function for rule's equality assessment significantly outperforms the traditional one. 2)A new idea to discover the main points of knowledge (DMPK) is first introduced in this thesis. The technology to mine Main Points of Knowledge (MPKs) can fast discover the non-complete, non-accurate and descriptive class knowledge. MPKs extracted by one-scan of dataset consist of accurate atomic-CARs, non-complete and non-accurate compound-CARs of which the confidence and support can be estimated only with corresponding bounds. For a given compound rule, if the upper bounds of both confidence and support minus their corresponding lower bounds are less than a given constant, the measurements of the rule are definite; if the the lower bounds of both confidence and support are greater than their corresponding thresholds, the existence of the rule is certain. So compound-CARs can be divided into three categories: (1)Certain-Type: those compound rules with certain existence and definite measurements; (2)Semi-Certain-Type: those with certain existence but without definite measurements; (3) Uncertain_Type:those without certain existence. For an uncertain-type rules,a probability is computed to measure the rule's existence. The DMPK method can provide service of exploratory knowledge discovery guided by users'interests. It can fast find out the main points of knowledge for concept description or for partial classification which can be further improved for complete and accurate classification with the help of algorithm design strategy. 3) A novel classification algorithm is created, that is, Classification based on Atomic Association Rules (CAAR). Different from the DMPK approach mentioned above, CAAR only mines atomic CARs (class association rules) for classifier construction which can avoid the "Combination Explosion Effect"of rule generation in associative classification. In CAAR algorithm, partial classification based on atomic CARs and a "Simple First"strategy are used so that the problem of low efficiency of associative classification is throughly solved. Algorithm analysis and a large amount of experimental results show that CAAR is greatly faster than CBA which is considered as a baseline algorithm of associative classification. Moreover, the atomic characteristics of a rule in CAAR classification can effectively reduce the over-fitting of classification model learning and raise the robustness in the situation of real-world application with many missing values of attributes.
Keywords/Search Tags:data mining, machine learning, genetic algorithm, class association rules, classification, main points of knowledge, classification based on atomic association rules, self-adaptive confidence threshold, relative support threshold, large-scale data mining
PDF Full Text Request
Related items