Font Size: a A A

A Hubness-aware Ensemble Learning Algorithm For High-dimensional Imbalanced Data Classification

Posted on:2021-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:Q WuFull Text:PDF
GTID:2518306122468844Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of data collection technology,the data collected in practical application gradually presents the characteristics of large scale,high dimension and imbalance.High dimensional unbalanced data learning is very common in many important applications,which also poses a severe challenge to traditional data mining and machine learning algorithms.The existing methods usually use the dimension reduction technology to deal with the dimension disaster,and then use the traditional class imbalance learning technology to solve the class imbalance problem.However,dimension reduction may lead to the loss of a large number of useful information,and the loss of a few kinds of data in unbalanced data will more easily lead to classification errors.Hubness phenomenon is an inherent phenomenon in high dimensional space,that is,some samples appear frequently(or rarely)as k nearest neighbors of other samples.Therefore,starting from the phenomenon of hubness,this paper studies the two major problems of dimension disaster and class imbalance distribution in high-dimensional unbalanced data,and solves the problem of high-dimensional unbalanced data from a new perspective.The main work of this paper includes the following three aspects:(1)Aiming at the learning imbalance in high-dimensional space,an integrated classification algorithm HIBoost based on hubness and oversampling is proposed.In this algorithm,hubness phenomenon in dimension disaster is considered,that is,there are singular points(hubs and anti-hubs)in high dimension space,which frequently(or rarely)appear in k-nearest neighbors of other points.For hubs and anti hubs generated in high-dimensional space,the algorithm introduces an influence factor to limit their weight growth in the process of weight updating,so as to reduce the risk of over fitting when training member classifiers.For the class imbalance problem,the algorithm uses smote to balance the training data in each iteration,in order to reduce the prediction bias of the member classifier.The experimental results show that HIBoost is superior to the typical integrated classification algorithm in the main evaluation indexes.(2)In order to solve the problem of over fitting and running cost of HIBoost algorithm when the number of classifiers is high,an integrated classification algorithm HUSBoost based on hubness and cluster sampling is proposed.For the ubiquitous hubs in high dimension,in the process of weight updating,different weight factors are introduced into most and a few samples to slow down the excessive growth of their weights,so as to alleviate the negative impact of "bad hubs" on the classification decision of member classifiers.In order to solve the problem of unbalanced distribution of categories,the algorithm adopts the under sampling method based on clustering,that is to say,the majority of categories are divided into several clusters by k-hub clustering technology,and then the representative majority of categories are selected from each cluster to form a balanced distribution of categories.Experiments show that HUSBoost is superior to several typical integration algorithms.(3)Based on the above two algorithms,a lightweight intelligent medical diagnosis prototype system is designed and implemented in this paper.The main work includes architecture design,database storage,model training and iteration,interface encapsulation and so on,and it is carried on the Wechat small program platform.
Keywords/Search Tags:Hubness, Class imbalance, High dimension, Sampling, Ensemble learning
PDF Full Text Request
Related items