Font Size: a A A

Research On Imbalance Data Classification Based On Hybrid Model

Posted on:2019-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:Z L LiFull Text:PDF
GTID:2428330566495920Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the advent of the Big Data era,all walks of life are constantly generating a large amount of data.Big data analysis and processing problems are prominent.Therefore,how to mine value from big data has become the focus for academia and industry.Big data is not only large but often imbalanced.That is to say,there is a large difference between the number of data subordinate to the normal / majority categories and the number of data belonging to the anomalous / minority categories.Traditional analyzing methods are difficult to deal with the imbalance data.Based on this,this thesis uses the hybrid model to describe the advantages of data distribution and studies the problem of imbalanced data classification.The main contents of this thesis are as follows:(i)A GMM-Na?ve Bayes algorithm based on Gaussian mixture model is proposed to solve the imbalanced data classification.The improvement of this algorithm is at the data processing level.The main work is to design an over-sampling algorithm based on GMM.In other word,original minority samples are modeled by GMM and then sampled by the trained GMM to obtain new minority class samples.The algorithm can effectively solve the problem that the traditional oversampling algorithm does not study the attributes of the sample set in depth,and the new minority samples can effectively improve the classification effect of the imbalanced data.(ii)Using the Gaussian mixture model,an integrated single-class learning method is proposed to solve the problem of imbalanced data classification from the perspective of learning algorithm.Specifically,by combining the GMM with the Support Vector Description algorithm,majority class samples are clustered by GMM.Then the SVDD training is performed based on the Single-class classifier.Finally,class-based single-class classifier integration is processed.This accuracy of this algorithm is better for multi-mode and multi-cluster sample description,which makes the classifier effectively improve the performance of unbalanced data classification.(3)In the process of clustering using GMM,it is necessary to designate the number of clusters in the distribution of minority samples in advance,and the classification results are also sensitive to this value.In this paper,we propose a DPMM based on Dirichlet Process Hybrid Model Sampling algorithm.First,using the Gaussian inverse Wishart distribution as a priori of the Dirichlet distribution,the classifications of minority samples are initialized by the CRP method and then iteratively updated using the Collapsed Gibbs sampling algorithm to train the DPGMM that reflects the minority data distribution.Finally,the trained DPMM is sampled to obtain a new minority sample.In this way,the optimal classification of minority class samples in imbalanced dataset is realized.
Keywords/Search Tags:Gaussian Mixture Model, One-class Classification, Support Vector Description, Dirichlet Process Mixture Model, Imbalanced Dataset
PDF Full Text Request
Related items