Font Size: a A A

Research On Class Imbalance Based On Spark

Posted on:2021-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:W J ZhuFull Text:PDF
GTID:2428330614460431Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,a large amount of data from all walks of life can be stored and accumulated,and people enter the era of information explosion.knowledge is power in the big data world,However,It is a big challenge to extract rules from this new knowledge for the traditional machine learning algorithms.On the one hand,the complexity of data aggravates the difficulty of data analysis,such as category imbalance and so on.On the other hand,these algorithms can not meet the scalability requirements of the distributed platform.Aiming at the above problems,the thesis has done some work:(1)A Cost Sensitive C45 Decision Tree Ensemble Class Imbalance Algorithm based on Spark is proposed.In this algorithm,all the positive class examples are aggregated and broadcast to each partition by making use of Spark broadcast mechanism,so that the class imbalance rate decreases in every partition.Then,C45 decision tree is trained in all partitions in parallel.During each iteration,the accuracy of the next classification can be improved by assigning different costs to the samples with wrong classification and correct classification.Finally,all sub-classifiers within the partition are integrated into the final classifier.Experiments show that the algorithm has good effect,high efficiency and expansibility.(2)Considering that in some datasets the positive class samples are too small or the class imbalance ratio is too large,the SMOTE and Tomek Link algorithms are implemented on Spark platform in this thesis.SMOTE can be used to expand the number of positive class samples to enrich decision area.Tomek Link algorithm can be used to delete the two different categories samples with the closest distance in the feature space to reduce the overfitting risk caused by class overlap.The experimental results show that the performance of CSCES can be improved by SMOTE and Tomek Link algorithm sampling.(3)Pre-processing of credit card fraud data and training of CSCES model are completed on Spark platform,and good results is obtained at last.
Keywords/Search Tags:class imbalance, Spark, C45 decision tree, ensemble learning
PDF Full Text Request
Related items