Font Size: a A A

Research And Application Of Imbalanced Data Processing Algorithm

Posted on:2020-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2428330590979101Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the development of computer science and electronic communication technology,we have entered the era of big data.The explosive growth of the amount and types of raw data has made all walks of life have an urgent need for the technology of data processing,which has also provided tremendous opportunities for the development of data mining and machine learning.In many realistic situations,traditional algorithms are based on the balanced class distribution of data sets and the equal cost of misclassification.However,the data that we have to process is usually imbalanced.And these situations include fingerprint recognition,face recognition,facial age estimation and so on.Therefore,the research on classification algorithms for imbalanced data has become a hot topic in the field of machine learning and data mining.This paper mainly studies the imbalanced data processing algorithm,and carries out the research work from the following three aspects:First of all,traditional algorithms usually only take the spatial distribution of data into consideration while ignoring spatial distance when dealing with imbalanced data.To address this shortcoming,a novel integration method based on K-means and the improved MaxDistance rule is proposed.This method combines the characteristics of spatial distribution and spatial distance of the original data,and transforms the problem of two kinds of imbalanced data into an equilibrium problem without losing any useful information or adding any artificial data.Compared with the existing processing methods for two kinds of imbalanced data,the experimental results prove that the method proposed in this paper has better performance on the same public standard data set.Secondly,an under-sampling method based on the combination of feature weight and clustering method is proposed,which is called the Uscfk algorithm.In order to improve the performance of the classification for the imbalanced data,this method increases the weight value of features that have a large impact on the classification result and decreases the weight value of features that have a small impact on the classification result.So that this method can be used in combination with K-Means algorithm to sample the most suitable data of different kinds for classification.Specifically,this method is proposed to optimize the assignment method of feature weight.In this way,suitable samples that are more conducive to the classification decisions will be sampled.As a result,a novel classification model for imbalanced data is constructed based on the combination of feature weight assignment method and clustering method.Finally,an experiment was conducted on the KEEL data set to prove the effectiveness of the integration algorithm,and the results verified that the proposed method improved the performance of classification for imbalanced data.In the last part of this paper,we test the proposed model on the public standard data set of wine.Compared with the results of traditional algorithms,the proposed algorithm can effectively improve the accuracy of classification for imbalanced data,and its application in wine classification also shows good performance.
Keywords/Search Tags:machine learning, imbalanced data, clustering method, ensemble method, sampling method
PDF Full Text Request
Related items