Font Size: a A A

Classification Methods For Class-imbalanced Datasets Of Unequal Misclassification Costs And Their Applications

Posted on:2013-08-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:M Z TangFull Text:PDF
GTID:1228330374488149Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Classification of class-imbalanced dataset is a hot topic of machine learning and pattern recognition. Class-imbalanced datasets exist in many real-world engineering domains, such as fault detection for non-ferrous metallurgical process and network intrusion detection etc. Most of the existing classification algorithms aim to minimize misclassification rate, and assume that the classes of training data set are balanced and misclassification costs are equal. When dealing with class-imbalanced problem, these algorithms often over-learn majority class and under-learn minority class, and further degrade the overall performance of trained classifiers.To deal with the problems such as imbalanced-class problem, unequal misclassification costs, noise features and expensive labeling costs etc, some new and more efficient classification methods of class-imbalanced datasets of unequal misclassification cost are proposed. The efficiency and advantage of the proposed methods are illustrated by simulation case studies in practical datasets of non-ferrous metallurgical process. The main content and some innovative chievements are described as follows:The characteristics of operational pattern set for the non-ferrous metallurgical process are studied in the second chapter. The problems, which include imbalanced-class problem and unequal misclassification costs, noise features and expensive labeling costs in the non-ferrous metallurgical process, are analyzed in detail. The assumptions of operational pattern classification are described. Finally, the flow chart of operational pattern recognition for the non-ferrous metallurgical process is given.Aim at the class-imbalanced problem and unequal misclassification cost in dataset, cost-sensitive probabilistic neural network (CS-PNN) is proposed in the third chapter. The poor performance of probabilistic neural network based on kernel density estimation and probabilistic neural network based on Gaussian mixture density function in class-imbalanced data set for minority is analyzed. By introducing cost-sensitive mechanism, CS-PNN is proposed. The proposed method is applied to classifying operational pattern set of copper flash smelting process. Experimental results show that, the proposed method increases the recognition rate of fault class and excellent class of operational patterns and reduces the average misclassification cost.Aim at the extreme class-imbalanced problem in dataset, support vector data description (SVDD) using sliding window and particle swarm optimization is proposed in the fourth chapter. The kernel parameter of SVDD is optimized using particle swarm optimization. The size of training data set is controlled by the size of large window of sliding window. The test error of small sliding window is used to adjust adaptively the size of large window. The proposed method is applied to classifying the operational pattern set of copper converter smelting process. Experimental results show that the proposed method identifies effectively the fault class operational pattern of copper converter smelting process.Aim at the class-imbalanced problem and noise feature in data set, cost-sensitive support vector machines using particle swarm optimization and cost-sensitive support vector machines of margin calibration using simultaneous optimization is proposed in the fifth chapter. At first standard support vector machine, cost-sensitive support vector machine, cost-sensitive support vector machine of margin calibration are introduced and compared, and the complete solution algorithms for these three methods are given. A continuous version of the particle swarm optimization algorithm is used to optimize the kernel parameter, misclassification cost parameters and margin parameters of cost-sensitive support vector machines. And the discrete version of particle swarm optimization is used to select features. The proposed method is applied on classifying artificial datasets and operational pattern of alumina evaporation process. Experimental results show that the proposed method identifies effectively excellent class and fault class operational pattern of alumina evaporation process and select the appropriate features of operational pattern. Aim at the class-imbalanced problem, unequal misclassification cost and expensive labeling cost in data set, cost-sensitive support vector machine based on uncertainty sampling with self-training method is proposed in the sixth chapter. The uncertainty of unlabeled sample is defined. Unlabeled sample with high uncertainty is selected to be labeled. Labeled sample is used to train three cost-sensitive support vector machines. Two of cost-sensitive support vector machines are used to predict the class label of unlabeled sample. If these predictive results are consistent, then the unlabeled sample is added to the training sample set. The third cost-sensitive support vector machine is retrained as the final classifier. Probability Approximately Correct (PAC) is used to analyze the self-training method. The proposed method is applied to classifying operational pattern of copper flash smelting process. Experimental results show that the proposed method reduces the labeling cost and average misclassification cost.
Keywords/Search Tags:class-imbalanced problem, cost-senstive probabiliticalneural network, support vector domain description, uncertainty sampling, particle swarm optimization, cost-senstive support vector machine, non-ferrous metallurgical process
PDF Full Text Request
Related items