Font Size: a A A

Research On Potential Home Broadband User Identification Problem With Large Scale Imbalanced Datasets

Posted on:2020-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:S Q LinFull Text:PDF
GTID:2428330572979113Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In the era of big data,data mining has become a necessary means to improve the core competitiveness of companies in all walks of life.In the telecommunication industry,operators with abundant data resources need to use data mining technology to improve market competitiveness,such as capturing the characteristics of target groups and achieve precision marketing by data mining.Potential home broadband user identification is a representative problem of precision marketing.Communications operators hope to distinguish potential home broadband users by analyzing terminal data,but the performance of traditional classification algorithms can't meet practical needs because of the imbalance problem.In this paper,we analyze the problem of potential home broadband user identification and study the issue of imbalanced binary classification based on the real dataset provided by a domestic telecommunication operator.Firstly,this paper analyses the dataset characteristics and classification difficulties,and designs a hybrid algorithm based on the needs of home broadband application scenarios.In view of the high-dimensional characteristics of the provided datasets,the robust MEM binary classification model is adopted as the core classifier of the algorithm on high-dimensional data.This paper also introduces the kernel function method to construct the kernelized MEM model for non-linear problems.According to the characteristics of strong real-time requirement of application,the algorithm contains two stages,and different stages of MEM model are combined with different imbalanced problem processing methods.In the offline stage,we combine an efficient MEM algorithm and the SMOTE algorithm to build an initial classifier.In the online stage,an online learning framework is proposed,where the different cost strategy is incorporated into the MEM algorithm to handle imbalance learning problems in an online manner.The initial model is updated with real-time data under the online learning.In the experiments based on datasets provided by the operator and multiple open datasets,our proposed algorithm can achieve better results and the improvement of performance metrics are mostly between 5%and 21%such as F1-measure and precision.Secondly,this paper also proposes a fast non-linear MEM model based on random Fourier feature method for large-scale imbalanced datasets.The random Fourier feature is used to approximate the kernel function in the non-linear MEM model and we can obtain the explicit mapping features.The new feature space is linearly separable,which can greatly reduce the computational complexity.Experiments on multi-group datasets shows that the classification performance of fast non-linear MEM model is generally slightly lower than that of the non-linear MEM model,which is about 4%.However,the fast one has obvious advantages in computing speed.If a dataset has 30000 samples,the acceleration factor can reach 30.The larger data size is,the greater acceleration advantage the model can reach.
Keywords/Search Tags:Binary Classification, Imbalanced Datasets, Online Learning, MEM, Random Fourier Feature
PDF Full Text Request
Related items