Font Size: a A A

Research On Instance-Oriented Classification Performance Evaluation And Credible Classifiers

Posted on:2022-08-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:S YuFull Text:PDF
GTID:1488306728982389Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification is one of the important methods in machine learning and data mining.The existing evaluation methods such as error rate and recall are often used to evaluate the performance of classifiers.These evaluation metrics take the whole instances as a statistic,so the characteristics of instances themselves are not considered in the final results.In fact,there are some application scenarios that pay more attention to the classifier's classification ability for an unknown event,rather than a statistical measurement for a large number of events.Therefore,it is necessary to measure the performance of the classifier from the perspective of the individual event(instance).This paper develops a novel instance-oriented classifier evaluation method and theory from the perspective of the classification difficulty of instances,aiming to form the evaluation criteria for “good” classifiers(credible classifiers),and use the insights of the new model evaluation as the guide to study the theories and methods of credible classifiers.The specific research content of this paper is given as follows:1.An instance-oriented classification performance measurePerformance evaluation is significant in classification.Existing evaluation methods are almost statistically significant measures from the perspective of classifiers,but ignore the characteristics of each instance itself.In this paper,a criterion for “good” classifiers is proposed from the perspective of “using instance data to evaluate classifiers”.First,the concept of classification difficulty of instances,which is an inherent characteristic of instances themselves,is proposed from a statistical perspective.Then,an instance-oriented evaluation metric called degree of credibility(Cr)is proposed based on the classification difficulty of instances.It is an average measure of the credibility of each instance to a classifier.Cr conforms to the natural cognition that the lower the probability of misclassifying relatively easy instances,the more credible the classifier.Also,based on the new evaluation theory,the concept of acceptable classifier is proposed to judge whether the trained model and its parameter set reach excellent ranks at the current technology level without comparing with other classifiers one by one.Experimental results show that Cr can effectively measure the credibility of classifiers,which is a good complement to the traditional classifier evaluation system;and the acceptable classifier is also beneficial to model selection and model training.2.A progressive classification algorithm based on the classification difficulty of instances.In classification,most of classifiers focus on correctly classifying those “hard”instances in order to achieve higher accuracy although these “hard” instances may be outliers or noise.If concerning about them excessively,the classifiers may be confused,causing the overfitting problems.In fact,the difficulty of instances plays a vital role in improving the generalization and credibility of classification.Unfortunately,existing classifiers almost always ignore this important information.Therefore,the effect of classification difficulty on the base learners is investigated under the ensemble learning framework,forming a progressive learning process from easy to difficult,and then obtains a novel and credible ensemble learning algorithm based on the classification difficulty of instances,named boosting with instance difficulty Invariance(BIDI).The BIDI algorithm conforms to the cognitive law that easy instances are misclassified with a lower probability than difficult ones.Experimental results demonstrate the generalization and credibility of BIDI on real-world tasks such as spam classification,credit card fraud detection and medical diagnosis.3.The credible classification algorithms based on the classification difficulty of instances.Generally,the traditional classification algorithms treat different instances equally.In fact,it is of great significance to consider the performance of the classifiers from the perspective of each instance.For example,most decision tree algorithms assume that all instances in the data set have the same degree of confidence,so they use the same generation and pruning strategies for all training instances.In fact,the instances with greater degree of confidence are more useful than the ones with lower degree of confidence in the same dataset.The support vector machine(SVM)algorithm also contains the implicit assumption that different types of errors have the same cost.However,some practical applications,such as credit card fraud detection,oil spill detection,and cancer diagnosis,usually have the characteristic that “the misclassification costs of different instances are not the same”.Therefore,when training decision tree algorithms and SVM models,different instances should be treated differently according to their characteristics.Based on the characteristic of the classification difficulty of instances,this paper studies the influence and significance of degree of confidence on the CART algorithm and the misclassification cost of instances on the SVM algorithm,then proposes the C?CART algorithm and the objective-cost-sensitive SVM algotithm(OCS-SVM).Experiments show that both C?CART and OCS-SVM algorithms avoid over-fitting to a certain extent,and significantly improve the generalization performance and degree of confidence.4.The calculation rules for the approximate difficulty of instances.The classification difficulty of instances from a statistical perspective is related to the current development level of classifier technology.Since the difficulty value of each instance is not explicit but embedded in the instance distribution,the computational cost of instance classification difficulty is relatively large.To enhance the practicality,the computational consumption of classification difficulty must be reduced.Notice that the method of calculating the approximate difficulty of each instance is not unique,which needs to be determined according to the specific data distribution.Thus,a new research direction is introduced: the computation of approximate classification difficulty of instances.In this paper,two low-cost rules for calculating the approximate classification difficulty of instances are proposed as the substitutes of the classification difficulty of instances.Experiments show that the two approximate calculation methods can indeed express the information of classification difficulty to a certain extent,while greatly improving the efficiency of calculating the difficulty values.
Keywords/Search Tags:Performance measure, Classification difficulty, Model selection, Credible classifier, Generalization
PDF Full Text Request
Related items