Font Size: a A A

Hubness-based Measure For High-dimensional And Imbalanced Data Classification

Posted on:2018-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:J M LiFull Text:PDF
GTID:2348330542459884Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of data acquisition technology and the wide application of the Internet,more and more application data show a high dimensional trend.Compared to the low dimensional data,various machine learning methods and tasks face serious challenges in high-dimensional data.In addition,many high dimensional data are often unbalanced in the real world,such as text classification,biomedical and image classification.At present,the study of k nearest neighbor(kNN)algorithm for high dimensional unbalanced data is rare.We analyze the special Hubness phenomenon of high dimensional data which appeard specific deviation situation in the unbalanced data,and study the Hubness algorithm for high dimensional of unbalanced data.The main work includes the following two aspects.(1)In order to solve the two challenging problems of dimensionality and imbalance distribution in high dimensional unbalanced data,we propose a k nearest neighbor classification algorithm(HWNN)based on Hubness and class weighting.In this algorithm,the k occurrence distribution of samples is taken as the support of each class in the prediction,as to reduce the potential negative impact of Hubness on kNN classification in high dimensional imbalanced data.In addition,by the way of class weighting,give the sample of minority class to a higher weight and increase the distribution of minority class in the k occurrence of all samples in order to improve the prediction accuracy of the minority class samples.The experiments on 16 unbalanced UCI datasets show that HWNN is superior to several other k-nearest neighbor classification algorithms.(2)Based on the HWNN,In order to solve the problem that overweight and underweight of global class weight in HWNN algorithm,a k nearest neighbor algorithm(HDWNN)based on Hubness and dynamic weighting is proposed.We consider the factors of sample support for each class that affect its own environment,and increase the dynamic weighting factor through the test samples and class distribution relationship.For each sample that can provide support,we use the conventional k nearest neighbor to calculate the correctness of the sample as a confidence for each class in the prediction.It can reduce the support for each class with higher error rate,and reduce the over weighted and under weight.Experiments on 16 unbalanced UCI datasets show that HDWNN is superior to several other k nearest neighbor classification algorithms and HWNN algorithms on MG and MAUC.
Keywords/Search Tags:Hubness, High-dimensional and unbalanced data distribution, curse of dimensionality, data classification, k-nearest neighbor, k-occur
PDF Full Text Request
Related items