Hubness-based Measure For High-dimensional And Imbalanced Data Classification

Posted on:2018-03-20

Degree:Master

Type:Thesis

Country:China

Candidate:J M Li

Full Text:PDF

GTID:2348330542459884

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of data acquisition technology and the wide application of the Internet,more and more application data show a high dimensional trend.Compared to the low dimensional data,various machine learning methods and tasks face serious challenges in high-dimensional data.In addition,many high dimensional data are often unbalanced in the real world,such as text classification,biomedical and image classification.At present,the study of k nearest neighbor(kNN)algorithm for high dimensional unbalanced data is rare.We analyze the special Hubness phenomenon of high dimensional data which appeard specific deviation situation in the unbalanced data,and study the Hubness algorithm for high dimensional of unbalanced data.The main work includes the following two aspects.(1)In order to solve the two challenging problems of dimensionality and imbalance distribution in high dimensional unbalanced data,we propose a k nearest neighbor classification algorithm(HWNN)based on Hubness and class weighting.In this algorithm,the k occurrence distribution of samples is taken as the support of each class in the prediction,as to reduce the potential negative impact of Hubness on kNN classification in high dimensional imbalanced data.In addition,by the way of class weighting,give the sample of minority class to a higher weight and increase the distribution of minority class in the k occurrence of all samples in order to improve the prediction accuracy of the minority class samples.The experiments on 16 unbalanced UCI datasets show that HWNN is superior to several other k-nearest neighbor classification algorithms.(2)Based on the HWNN,In order to solve the problem that overweight and underweight of global class weight in HWNN algorithm,a k nearest neighbor algorithm(HDWNN)based on Hubness and dynamic weighting is proposed.We consider the factors of sample support for each class that affect its own environment,and increase the dynamic weighting factor through the test samples and class distribution relationship.For each sample that can provide support,we use the conventional k nearest neighbor to calculate the correctness of the sample as a confidence for each class in the prediction.It can reduce the support for each class with higher error rate,and reduce the over weighted and under weight.Experiments on 16 unbalanced UCI datasets show that HDWNN is superior to several other k nearest neighbor classification algorithms and HWNN algorithms on MG and MAUC.

Keywords/Search Tags:

Hubness, High-dimensional and unbalanced data distribution, curse of dimensionality, data classification, k-nearest neighbor, k-occur

PDF Full Text Request

Related items

1	Similarity Search On Large-scale High-dimensional Data
2	Research On K-nearest Neighbor Search Algorithm In High Dimensional Space
3	Research On K-nearest Neighbor Query Technology On High-dimensional Data
4	Nonlinear Dimensionality Reduction Based On Stochastic Initialization
5	Efficient computation of k-nearest neighbor graphs for large high-dimensional data sets on gpu clusters
6	A Hubness-aware Ensemble Learning Algorithm For High-dimensional Imbalanced Data Classification
7	Research On K Nearest Neighbor In High Dimension Data
8	Research Of Local Sensitive Hash Index Based On Nearest Neighbor Graph
9	Online Learning Algorithms For Classification Of Streaming Data
10	Hash-based Approximate Nearest Neighbor Search For High-dimensional Data