Font Size: a A A

Neural Network Based Classification Methods For Imbalanced Datasets

Posted on:2016-02-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z P YangFull Text:PDF
GTID:1228330467476668Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The classification of imbalanced data is one of the most important issues in machine learning, which attracts wide attention of many scholars at home and abroad. Imbalanced datasets exist in a wide range of real applications in several fields such as gene expression data, credit transactions and medical diagnosis. Most of the existing classifiers ignore the minority classes for obtaining high overall classification accuracies when learning the imbalanced datasets. The problem of how to change the data distribution of dataset and improve the classification accuracy of minority classes while maintaining the overall classification performance needs to be solved.This thesis analyzes the cause that the performance of the traditional classifiers is hindered, compares several existing methods, and furthermore, puts forward some novel neural network based classification methods which focus on datasets and algorithms. The main contributions of this thesis are as follows.(1) Having analyzed the influences of imbalanced datasets on performance of back-propagation (BP) algorithm, this thesis presents an active under-sampling algorithm, which can automatically get rid of the samples of the majority classes which are far from the decision boundaries and reduce the degree of imbalance between classes while similarly keeping the density distributions of the whole training datasets. The approach can effectively improve the accuracies of minority classes while maintaining the overall performance by the experimental results, compared with the existing under-sampling methods.(2) The traditional sampling methods always lead to class overlapping for imbalanced data. What is more, the existing data cleaning methods often delete the non-noise samples. The borderline noise factor (BNF) is proposed based on outlier detection technology and the sampling method. Furthermore, a BNF-based data cleaning algorithm is given. The experimental results show that the method combining the cleaning algorithm with sampling is effective in correcting class imbalance and overlap and improving the performance of the BP algorithm.(3) A novel QPSO-ELM algorithm is put forward by studying the influences of performance of ELM algorithms for imbalanced datasets, which uses Quantum-behaved Particle Swarm Optimization (QPSO) to optimize the structure of ELM, achieves a good balance between the empirical and the structural risk and adopts G-mean as the fitness function. The experimental results show that the new algorithm can achieve a good performance with optimal structure in imbalanced datasets.(4) The costs associated with misclassifying examples among different classes are the same which are considered by ELM learning imbalanced datasets, so the accuracies of minority classes are low. A new ELM algorithm is proposed to adopt new weight values, which is more suitable for the imbalanced datasets than the traditional ELM. ELM would multiply the structural complexity of neural networks and impact the scalability, because the random selection of input weights and hidden biases usually results in the number of hidden nodes redundancy. The adaptive pruning algorithms are proposed to solve the hidden-node redundancy of ELMs, which use two pruning criteria, the orthogonal projection distances and the norms of output vectors of hidden nodes, respectively. The experimental results show that these algorithms are suitable for balanced datasets and have good generalization performance.(5) Gene expression data has some characteristics such as unbalanced data, high dimensions and small number of samples. IIC (Information Index to Classification) is adopted as the criteria to select genes, and then PCA is used to reduce the dimension. Finally, different methods are applied in the real gene expression datasets such as the colon datasets, the leukemia datasets, SRBCT datasets and protein datasets. The experimental results show that these algorithms based on imbalanced data can improve the classification accuracy of gene expression datasets.In sum, several learning methods consisting of the balance of training datasets and the modification of standard learning algorithms are proposed in this thesis, which improve the classification performance of neural networks for imbalanced datasets of UCI and gene expression data.
Keywords/Search Tags:Neural networks, Imbalanced datasets, Classification, Extreme learning machine, Quantum-behaved particle swarm optimization, Hidden-node selection
PDF Full Text Request
Related items