Classification Methods For Class-imbalanced Datasets Of Unequal Misclassification Costs And Their Applications

Posted on:2013-08-21

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M Z Tang

Full Text:PDF

GTID:1228330374488149

Subject:Control Science and Engineering

Abstract/Summary:

Classification of class-imbalanced dataset is a hot topic of machine learning and pattern recognition. Class-imbalanced datasets exist in many real-world engineering domains, such as fault detection for non-ferrous metallurgical process and network intrusion detection etc. Most of the existing classification algorithms aim to minimize misclassification rate, and assume that the classes of training data set are balanced and misclassification costs are equal. When dealing with class-imbalanced problem, these algorithms often over-learn majority class and under-learn minority class, and further degrade the overall performance of trained classifiers.To deal with the problems such as imbalanced-class problem, unequal misclassification costs, noise features and expensive labeling costs etc, some new and more efficient classification methods of class-imbalanced datasets of unequal misclassification cost are proposed. The efficiency and advantage of the proposed methods are illustrated by simulation case studies in practical datasets of non-ferrous metallurgical process. The main content and some innovative chievements are described as follows:The characteristics of operational pattern set for the non-ferrous metallurgical process are studied in the second chapter. The problems, which include imbalanced-class problem and unequal misclassification costs, noise features and expensive labeling costs in the non-ferrous metallurgical process, are analyzed in detail. The assumptions of operational pattern classification are described. Finally, the flow chart of operational pattern recognition for the non-ferrous metallurgical process is given.Aim at the class-imbalanced problem and unequal misclassification cost in dataset, cost-sensitive probabilistic neural network (CS-PNN) is proposed in the third chapter. The poor performance of probabilistic neural network based on kernel density estimation and probabilistic neural network based on Gaussian mixture density function in class-imbalanced data set for minority is analyzed. By introducing cost-sensitive mechanism, CS-PNN is proposed. The proposed method is applied to classifying operational pattern set of copper flash smelting process. Experimental results show that, the proposed method increases the recognition rate of fault class and excellent class of operational patterns and reduces the average misclassification cost.Aim at the extreme class-imbalanced problem in dataset, support vector data description (SVDD) using sliding window and particle swarm optimization is proposed in the fourth chapter. The kernel parameter of SVDD is optimized using particle swarm optimization. The size of training data set is controlled by the size of large window of sliding window. The test error of small sliding window is used to adjust adaptively the size of large window. The proposed method is applied to classifying the operational pattern set of copper converter smelting process. Experimental results show that the proposed method identifies effectively the fault class operational pattern of copper converter smelting process.Aim at the class-imbalanced problem and noise feature in data set, cost-sensitive support vector machines using particle swarm optimization and cost-sensitive support vector machines of margin calibration using simultaneous optimization is proposed in the fifth chapter. At first standard support vector machine, cost-sensitive support vector machine, cost-sensitive support vector machine of margin calibration are introduced and compared, and the complete solution algorithms for these three methods are given. A continuous version of the particle swarm optimization algorithm is used to optimize the kernel parameter, misclassification cost parameters and margin parameters of cost-sensitive support vector machines. And the discrete version of particle swarm optimization is used to select features. The proposed method is applied on classifying artificial datasets and operational pattern of alumina evaporation process. Experimental results show that the proposed method identifies effectively excellent class and fault class operational pattern of alumina evaporation process and select the appropriate features of operational pattern. Aim at the class-imbalanced problem, unequal misclassification cost and expensive labeling cost in data set, cost-sensitive support vector machine based on uncertainty sampling with self-training method is proposed in the sixth chapter. The uncertainty of unlabeled sample is defined. Unlabeled sample with high uncertainty is selected to be labeled. Labeled sample is used to train three cost-sensitive support vector machines. Two of cost-sensitive support vector machines are used to predict the class label of unlabeled sample. If these predictive results are consistent, then the unlabeled sample is added to the training sample set. The third cost-sensitive support vector machine is retrained as the final classifier. Probability Approximately Correct (PAC) is used to analyze the self-training method. The proposed method is applied to classifying operational pattern of copper flash smelting process. Experimental results show that the proposed method reduces the labeling cost and average misclassification cost.

Keywords/Search Tags:

class-imbalanced problem, cost-senstive probabiliticalneural network, support vector domain description, uncertainty sampling, particle swarm optimization, cost-senstive support vector machine, non-ferrous metallurgical process

Related items

1	Research Of Learning Methods On Single-class Support Vector Machine
2	Support Vector Machine Based Classification Algorithms Research For Imbalanced Data
3	Investigation Of Parameters Optimization And Solution Method For Cost-sensitive Support Vector Machine And Its Application
4	Research On Several Problems In Support Vector Machine And Support Vector Domain Description
5	Support Vector Regression Machine Theory And Its Industrial Application
6	Support Vector Data Description And Support Vector Machine And Their Applications
7	Support Vector Machine With Input Uncertainty And Its Application To Bioinformatics
8	Research On ICS Intrusion Detection Methods Based On One Class Support Vector Machine
9	The Parameter Optimization Of Support Vector Machine Based On Improved Particle Swarm Optimization And Its Application
10	Parameters Optimization Of Support Vector Machine Based On Improved Quantum Particle Swarm Optimization Algorithm