Font Size: a A A

Imbalanced Data Mixed Sampling Algorithm And Its Application In Customer Churn Prediction

Posted on:2022-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:S YeFull Text:PDF
GTID:2518306770971859Subject:Enterprise Economy
Abstract/Summary:PDF Full Text Request
In the context of economic globalization,the competition between domestic and foreign enterprises has become more and more intense,resulting in higher costs for enterprises to acquire new customers,and reducing the loss of long-term customers can effectively increase the company's profits.Therefore,the company is focusing on Transfer of acquiring new customers to retaining existing customer base.Accurate churn prediction helps companies find potential churn customers to reduce losses and is therefore considered a marketing priority.But customer churn is a small probability event in many industries,meaning that the number of customers left in a company far exceeds the number of customers that are lost.Therefore,customer churn prediction can be attributed to the classification problem of imbalanced data,where customers are lost Data is in the minority class,while customers retained by the company are in the majority class.The traditional classification learning algorithm cannot make good use of unbalanced data sets,so that potential lost customers cannot be accurately detected,which brings a lot of losses to the company,reduces the competitiveness of the company in the industry,and is not conducive to the company's long-term develop,Therefore,an accurate customer churn prediction system is very important for enterprises.In order to improve the identification rate of traditional classification algorithms for minority classes,thereby improving the prediction results of the customer churn prediction system,this dissertation starts from the direction of data sampling algorithm and proposes Three hybrid algorithms are proposed,and the main research results are as follows:(1)A hybrid sampling algorithm based on SMOTE is proposed to solve the problem that the classical SMOTE algorithm tends to cause marginalization of sample distribution and introduce noise when synthesizing new samples.The algorithm first divides the samples in the training set into a majority class dataset and a minority class dataset according to the sample labels,and then uses an undersampling algorithm combined with clustering for the majority class dataset.The algorithm uses the number of minority class samples to calculate the K value to perform K-Means clustering on the majority class samples,and then replaces the entire majority class cluster with samples that are closer to the cluster center,and reduces noise by reducing the number of synthesized minority class samples.Generation of samples.Then,the triangular midline oversampling algorithm is used for the minority class samples,and the area of the synthetic samples is limited to the inside of the triangle,which better solves the problem of the marginalization of the sample distribution.Compared with various sampling algorithms,this algorithm has achieved better results on the public customer data set.(2)A hybrid sampling algorithm based on DBSCAN clustering is proposed.The algorithm is proposed for the problem of intra-class imbalance of minority class samples and the problem of poor quality of majority class samples retained by undersampling algorithm.The algorithm first uses the data density-based undersampling algorithm for the majority class samples,so that the retained majority class samples have higher value,and secondly,the algorithm removes noise data and outliers by performing DBSCAN clustering on the minority class,and collects the remaining samples.The minority class samples are divided into clusters of different densities,Then,by calculating the cluster density of the clusters,the minority clusters are divided into dense clusters and sparse clusters,and the sparse clusters are assigned a higher sampling rate,and the dense clusters are assigned a lower sampling rate.By synthesizing new samples in the clusters The number of samples in different clusters tends to be balanced to solve the problem of imbalance within the sample class.The experimental results show that the recognition accuracy of this algorithm is higher than other sampling algorithms on the public imbalanced customer dataset.(3)A hybrid sampling algorithm using K-nearest neighbors is proposed.The algorithm first uses the safe area undersampling algorithm proposed in this chapter for the majority class samples to delete noisy data and some useless samples.Then calculate the distance from the minority class sample to all training samples by Euclidean distance,and get the K samples closest to it,and divide the minority class into the boundary domain and the security domain by the type of these K samples,Since the samples at the boundary have a greater effect on the classification,a higher sampling ratio is given,and the interpolation strategy proposed in this chapter is used in the synthesis to make the majority of the classes also participate in the generation of the samples,which solves the problem that the quality of the new samples is degraded due to the too single way of synthesizing samples.Compared with other sampling algorithms in the public imbalanced customer data set,this algorithm achieves better results.
Keywords/Search Tags:Customer churn prediction, Imbalanced dataset, SMOTE algorithm, Mixed sampling algorithm
PDF Full Text Request
Related items