Font Size: a A A

Research On Parallel Random Forest And Fuzzy C-Means Algorithm For Imbalanced Data

Posted on:2020-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:L Y DuanFull Text:PDF
GTID:2428330620462263Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Nowadays,imbalanced data is fairly common in big data analysis and machine learning task.Fundamental machine learning algorithms are designed in terms of balance data.Nevertheless,with input of imbalanced data the precision of specific algorithm will decrease dramatically.Consequently,it is a question worthy of study that how to improve ordinary algorithms and expand their scope of application to imbalanced data.As a big data parallelized calculation framework,Apache Spark is really popular in Internet at present and parallelized machine learning algorithms are more practical than serial ones.Different algorithms need different improving methods to make themselves applicable to imbalanced data.Two representative algorithms,random forest which is applied in classification tasks and fuzzy c-means used in clustering tasks are discussed and improved by different strategies.The main works of this thesis are described as follow:(1)An improved random forest algorithm is proposed to overcome disadvantages of existing ones with high complexity,low parallelism and low scalability.The improved random forest is able to promise that each sample will relief lack of balance because of synthetic minority over-sampling technique(SMOTE)while imbalanced will aggravate in traditional algorithm.In improved algorithm,simple vote method in ordinary random forest algorithm is replaced by weighted vote which has taken the accuracy of out of bag(OOB)samples into consideration and increased the influence of minority samples.Binning strategy in Apache Spark Machine Learning Library is followed and multi way decision trees take the place of binary decision trees.Under Spark environment,the results prove that time efficiency,recall rate and F1 index of improved algorithm are superior to other algorithms with the input of imbalanced data.The extendibility of the improved algorithm is fantastic as well.(2)Heuristic knowledge is required for existing fuzzy c-means algorithms(FCM)for imbalanced data which are low in accuracy with aspherical data input.Consequently,a parallelized kernel fuzzy c-means algorithm(KFCM)is proposed to overcome those disadvantages.Two phases clustering strategy is implemented in this improved algorithm.First,data in each partition are clustered by KFCM and center points in all partitions are differ from each other.Next,those center points in all partitions are collected in one machine and clustered by weighted kernel fuzzy c-means(wKFCM)to make up the precision.Under Spark environment,the results of artificial data sets demonstrate that the improved algorithm is really suit for imbalanced aspherical data clustering task.The experiment results also show that on the promise of extendibility the improved algorithm inspired by two phases clustering strategy is more accurate than traditional algorithms.(3)In order to solve the problem that single classification algorithm is weak to process imbalanced data,a hybrid framework combining cluster and classification algorithms is proposed.The advanced framework is able to overcome most of intrinsic characteristics in imbalanced data and experiment results demonstrate it has higher accuracy than previous one.
Keywords/Search Tags:imbalanced data, random forest, fuzzy c-means, Apache Spark
PDF Full Text Request
Related items