Font Size: a A A

Cost Sensitive Data Mining Based On Support Vector Machines: Theories And Applications

Posted on:2007-12-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:E H ZhengFull Text:PDF
GTID:1118360182490570Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Data mining, emerged during the late 1980's, has made great strides and is expected to continue to flourish. Data mining is the process of extracting knowledge hidden from large volumes of raw data. There is growing interest in data mining theories and applications in recent years due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. However, the majority of the data mining literature ignores all types of cost (unless accuracy is interpreted as a type of cost measure) involved in so many real-world applications as medical diagnosis and fraud detection fields, so these algorithms without taking all types of cost into account do not perform well, and cost sensitive data mining should be introduced. Cost sensitive data mining is defined as the problem of learning decision model minimizing expected total costs, given a training set. For example, in medical diagnosis, the cost of erroneously diagnosing a patient to be healthy may be much bigger than that of mistakenly diagnosing a healthy person as being sick, because the former kind of error may result in the loss of a life.Support vector machines (SVM) are a new class of data mining algorithms, motivated by the structural risk minimization (SRM) induction principle of statistical learning theory. SVM captures the main insight of statistical learning theory (in order to obtain a small risk (cost) function, one needs to control both training error and model complexity) and shows better generalization ability than data mining algorithms based on empirical risk minimization (ERM) principle. SVM have proven to be effective in many practical applications. However, SVM are not cost sensitive, like traditional algorithms.Some implementation algorithms of cost sensitive data mining based on SVM are proposed in this thesis aiming to develop practical and near-optimal algorithms for learning how to solve cost sensitive data mining tasks. In detail, the major contributions of this dissertation are as following:1. Based on SVM, firstly, the bound of both the SV number (and rate) and BSV number (and rate) is proposed and is proved, further the bounds are extended to positive class and negative class respectively. Secondly, it is presented and testified that the SV rate and BSV rate of positive class is higher than that of negative class. Thirdly, that the positive class yields poorer classification and predictive accuracy than the negative class does is attested. Experimental study based on German credit and Heart disease data sets shows that the hypothesis and conclusion proposed is true and effective.2. A novel cost sensitive data mining method for support vector machine classifiers with reject cost and unbalanced misclassification cost (SVM-RMC) is proposed based on SRM principle. In SVM-RMC, the decision function, rejection region included, can be determined during the training phase of a classifier, by the learning algorithm. To implement SVM-RMC, we develop a novel formulation of the training problem, and a specific algorithm to solve it. Experimental results based on some artificial and benchmark data sets shows that SVM-RMC reduced the total cost and improved the classification reliability.3. A novel general cost sensitive data mining (G-CSDM) algorithm for making an arbitrary classifier cost sensitive is proposed by wrapping probability estimation and a cost minimizing procedure around it, and a particular implementation based on SVM, called CS-SVM, is achieved. Experimental results based on artificial and benchmark data sets shows that CS-SVM reduced the total misclassification cost.4. In order to overcome the overfitting problem caused by noise in training data set, a noise cost model based on k nearest neighbors (KNN) algorithm in feature space is presented and is applied to SVC and SVR algorithms, then SVC algorithm with noise cost (SVC-NC) and SVR algorithm with noise cost (SVR-NC) are proposed. Experimental results show that both SVC-NC and SVR-NC algorithms can largely reduce the effect of noise in training set on learning model, and have better generalization ability.5. Under some restrictions, the functional equivalence between SVM and a kind of FIS is proposed, and further MBFIS-SRM and MBFIS-SRM-MC is devised based on SRM principle. In MBFIS-SRM and MBFIS-SRM-MC, the number of rules and rules base generate automatically by algorithm. Experimental results based a few benchmark data sets show that MBFIS-SRM have better generalization ability and MBFIS-SRM-MC reduced the average test misclassification cost.6. A few data mining process model and data mining software are introduced and general data mining patterns and technologies are reviewed. A novel bidirectional feedback data mining process model (BFDM) is proposed base on the understanding of both data mining process model and metallurgy process industry. A novel data mining system software DMP is implemented in which some data mining models in metallurgy process industry are built to solve real industrial problems.
Keywords/Search Tags:Data mining, Cost sensitive data mining, Support vector machines
PDF Full Text Request
Related items