Font Size: a A A

The Abnormal User Identification In Communication Operators Based On Unbalanced Data

Posted on:2023-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhouFull Text:PDF
GTID:2558307100977469Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
In the Internet plus environment,communication operators lure consumers by issuing benefits or coupons.But at the same time,a group of econnoisseurs has emerged,causing heavy system load and loss of profits to operators.Thus,it is necessary to analyze and study the consumption behavior of such abnormal users.Through the analysis of mobile user consumption data,the behavior characteristics of abnormal users and the user identification model can be obtained,which is helpful for operators to take blocking measures in time,thereby avoiding capital loss and resource occupation,and forming an effective early warning system.The data in this thesis comes from the China Mobile Big Data Application Innovation Competition,which included 433,413 users and 18 variables.First,to deal with missing values in the data,the categorical variables are filled and the continuous variables are discretized.On this basis,a new feature is introduced and all categorical features were encoded.Then,the behavioral characteristics of the econnoisseurs are analyzed.After m RMR method and random forest algorithm were used for feature selection,the threshold shifting method is used to reduce the impact of class-imbalance of data.Subsequently,a Logistic regression model is established.Through this model,the important behavioral characteristics of the party can be indentified,and rules of indentification are summarized.By means of these rules,users of the data are preliminarily filtered.Finally,this thesis establishes an identification model with higher accuracy and precision based on the filtered data.The problem of class-imbalance of data is studied from the data level and the algorithm level.A variety of resampling methods are used at the data level,but the results show that the performance of model is not significantly improved compared with before.While at the algorithm level,the Easy Ensemble algorithm is used to generate 10 class-balanced subsets of data.After training the models respectively,a combination model is built through Bagging ensemble algorithm.The classification performance of decision tree,random forest,LightGBM and their combination models are compared and analyzed.The results show that the performance of LightGBM model and its combination model is better,with F1 values of 94.86% and 94.62% respectively,and LightGBM algorithm only requires less training time,which can effectively meet actual business needs.
Keywords/Search Tags:Abonormal user, Imbalaced data, LightGBM, Ensemble learning
PDF Full Text Request
Related items