Font Size: a A A

Research On Imbalanced Data Processing Algorithm And Application In IPTV

Posted on:2019-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:B ZangFull Text:PDF
GTID:2428330566495929Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,the type and quantity of data have exploded,data mining and machine learning have played a great role in data analysis and processing.Data mining is the intersection of database technology and machine learning.It uses database technology to manage massive data and uses machine learning and statistical analysis to analyze data.Data classification is also an important research object in data mining.In practical applications,imbalanced data are often encountered,and most of the traditional algorithms are based on a balanced class distribution of datasets and equal misclassification costs.Therefore,the research on the theory and application of imbalanced data processing has become the focus and hotspot in the field of machine learning and data mining.This paper mainly studies the imbalanced data processing algorithms from the theoretical aspects,and applies it to IPTV users' fault prediction,mainly from the following three aspects of research work:First of all,for imbalanced data,the number of minority samples are far less than the majority samples,while minority samples have the same weight as the majority samples in KNN algorithm.In response to this problem,a method of increasing the weight of minority samples based on the local distribution of minority samples is proposed.Compared with the original algorithm,the experimental results show that the proposed algorithm improves the classification performance of the imbalanced data to a great extent.Secondly,according to the distribution characteristics of imbalanced datasets,a sampling method based on clustering is proposed to improve the imbalance of data,so that the training datasets are basically balanced.Concretely,due to Relief-F algorithm is partial to the majority samples,the Relief-F algorithm is improved to increase the sampling rate of minority samples,then K-Means clustering is carried out based on improved feature weights,and then a number of balanced training subsets are constructed by means of sampling.Finally,the experimental results verify the performance advantages of the ensemble algorithm based on feature selection and cluster sampling.Finally,in actual IPTV scenario,combining the integrated algorithm based on feature selection and cluster sampling with the improved KNN algorithm based on the local distribution of minority samples is used to establish the model of IPTV users' fault prediction.Concretely,we firstly analyze the KPI data of IPTV system,preprocess the data and conduct the feature selection,and then form balanced datasets by sampling method based on clustering.Finally,an improved KNN algorithm based on feature weight is used as a base classifier.The experimental results show that the model of IPTV users' fault prediction proposed in this paper improves the prediction accuracy effectively.
Keywords/Search Tags:Machine Learning, Imbalanced Data, KNN, Ensemble Method, IPTV
PDF Full Text Request
Related items