Font Size: a A A

Research On Unbalanced Dataset Classification Algorithm And Its Application In E-commerce Consumer Purchasing Behavior Prediction

Posted on:2021-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y LinFull Text:PDF
GTID:2517306302954299Subject:Economic statistics
Abstract/Summary:PDF Full Text Request
E-commerce consumer behavior prediction is a specific application of statistical analysis and data mining in Internet big data marketing.One of the main problems currently facing the field is the non-equilibrium of data.The traditional data mining algorithm takes the accuracy of prediction as the training target,and often loses its usefulness for unbalanced data,especially extremely unbalanced data,that is,most of the classes tend to have higher prediction accuracy,while the minority class has poor prediction accuracy.For example,in an extremely unbalanced data set with a positive-negative sample distribution of 1:99,the training model only needs to determine all samples as negative,that is,99%of the prediction accuracy can be achieved.Although the prediction accuracy is very high at this time,its prediction accuracy for the positive example is 0%,and the model has no application value at this time.It can be seen that solving the problem of data non-equilibrium in e-commerce consumer behavior prediction research has important practical significance for the model prediction performanceThe main work of this paper is to find ways to solve the problem of data imbalance in e-commerce consumer behavior prediction.The problem of data imbalance in the field of statistics and data mining mainly starts from three aspects:(1)data rebalancing processing;(2)improvement of traditional data mining or statistical machine learning algorithms;(3)combining the former two methods.This paper mainly studies the method of solving the problem of data non-equilibrium in e-commerce consumer behavior prediction from the perspective of point(3)(1)In this paper,the E-commerce consumer behavior data is oversampled and rebalanced.In this paper,the SMOTE algorithm and the improved Banding-Edge SMOTE algorithm are used to perform the resampling processing on the original samples respectively,and the training models under the two sampling methods are compared.Predictive ability,and for the analysis results,empirical research shows that the B anding-Edge SMOTE algorithm is more suitable for e-commerce consumer behavior prediction research(2)This paper will optimize the Flscore algorithm into the field of e-commerce consumer behavior prediction research.At present,mainstream algorithms often have their own defined objective functions.F1score is only used as an evaluation metric in the model tuning process.This paper improves the existing model algorithm by introducing F1hingeloss,so that the algorithm will optimize Flscore as a direct target.In this paper,we compare the prediction performance of the random forest algorithm,the adaptive lifting algorithm and the modified version of the two algorithms.The empirical research shows that the random forest algorithm and the adaptive gradient lifting algorithm after the modification of F1hingeloss Balanced e-commerce consumer behavior prediction studies have better performance.(3)This paper introduces the stacking mechanism into the research field of e-commerce consumer classification behavior prediction.Aiming at the model-probability output of the above training optimization,this paper proposes a fusion and prediction based on the stacking mechanism.Through experiments,this paper compares the predictive performance of the integrated model after the fusion of the model and the multi-model.The empirical study shows that further integration of the dominant model can better improve the predictive performance of the integrated model.Although some effects have been achieved in the research process,there are still many areas that can be further improved:(1)The data description and analysis stage is not fully sufficient for the extraction and mining of feature variables.In the future,it can be further developed in this aspect,providing more and effective feature variables for model training and optimization,which helps to guarantee the prediction model.Stability and effectiveness.(2)This paper only compares the SMOTE resampling method based on a small number of samples to generate new samples and the improved version of Banding-Edge SMOTE sampling method proposed in this paper.It has not compared the improvement of other possible resampling methods,and it can be further developed in this aspect.Comparative study.At the same time,the Banding-Edge SMOTE method can be further integrated with the algorithm of optimizing F1score as an effective and better sampling method,and the hybrid algorithm can predict the ability to overcome unbalanced data problems.(3)In the research process of random forest and adaptive gradient lifting algorithm based on F1hingeloss improvement,the experiment of optimal parameter adjustment is not carried out in theory.In theory,there may be better parameters to make the model achieve better prediction.Performance,this part of the work is reserved for further research in the future.In addition,F1hingeloss as a loss function can play a more direct and effective role in the process of optimizing F1score,and can explore the classification ability of other related classification algorithms after integrating F1hingeloss in further research.(4)The model integration stage only integrates several models that have been compared and optimized in this paper.The advantage of Stacking integration lies in the integration of the dominant model with large differences.Therefore,other different models can be studied for unbalanced e-commerce consumption.The predictive performance of the behavioral data,the selection of the superiority model is provided to the stacking stage for fusion,and it is expected to better improve the overall prediction performance and stability of the comprehensive model,which is reserved for subsequent research.
Keywords/Search Tags:e-commerce, classification model, machine learning, unbalanced data, re-sampling
PDF Full Text Request
Related items