Research On The Re-sampling Technology Of Data Mining For High-dimensional Imbalanced Dataset

Posted on:2017-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2348330482484843

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Imbalanced data set is a widespread data form in the area of data mining, and it refers to that the number of data samples of different categories generates a large gap.Due to the wide gap of quantity, the effect of normal classification algorithm is not obvious. In the field of data mining high dimension is also an inevitable problem.Gathering and processing of datasets result in high dimension of attributes, a large number of attributes express a small amount of information, bringing inconvenience to data mining. The imbalance and high dimension of data sets make trouble for data analysis and knowledge discovery, so the research on this kind of dataset has received more and more attention. With the rapid development of computer technology and progress, the classification problems basing on data mining and machine learning become the method of high-speed decision, accurate judgment and effective auxiliary of enterprise and organization. And the imbalanced data sets with high attributes dimension widely appear in computer science, bioinformatics,economics and other fields of application, for the imbalance that people often care about the minority classes, and for high dimensional data people often concern about the interference in machine learning caused by a large number of attributes with a small amount of information. So it is particularly important for the processing of such datasets.In this paper, we first introduce the imbalanced datasets and high dimension of attribute, and experts at home and abroad promote the progress of research on such data sets, this paper expounds the influence of high-dimensional imbalanced dataset on data mining, the commonly used processing methods and the standard evaluation metric of classification performance on imbalanced data sets. The strategy of a cluster boundary re-sampling method based on DBSCAN algorithm combined with support vector machine(SVM) can effectively solves the data imbalance problem,but it works bad if imbalance dataset has too many attributes. So the method of dimensionality reduction based on signal to noise ratio is added in the strategy to solve high dimensional imbalance. In addition, the SOM algorithm and the samplingmethod based on data generation are introduced, a novel re-sampling method based on SOM is proposed by combining the two principles to solve the problem of imbalanced dataset, and the strategy based on Relief and SOM is put forward. At last,the two strategies are applied to DNA microarrays, and it is of practical significance using the methods of data mining in DNA microarrays. Besides, the experiments verify that the two strategies can deal with high-dimensional imbalanced data effectively, and addressing DNA microarrays data is feasible by the strategies.

Keywords/Search Tags:

imbalanced data, high dimension, re-sampling, self organizing map

PDF Full Text Request

Related items

1	A Hubness-aware Ensemble Learning Algorithm For High-dimensional Imbalanced Data Classification
2	Research On Hybrid Sampling Of Imbalanced Data Based On Data Distribution
3	Research On Classification Method Of High-dimensional Class-imbalanced Data Sets Base On SVM
4	Research On Imbalanced Dataset Classification Algorithm Based On Sampling
5	Research On Approach For Classification Of Imbalanced Data Sets With High Density
6	Data Distribution-driven Adaptive Hybrid Sampling Method For Imbalanced Data Processing
7	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
8	The Research Of Imbalanced Data Based On Oversampling Technique
9	Research On Under-sampling Algorithm For Imbalanced Data Based On Clustering And Its Application
10	Research And Application Of Imbalanced Data Processing Algorithm