Font Size: a A A

Research On The Re-sampling Technology Of Data Mining For High-dimensional Imbalanced Dataset

Posted on:2017-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2348330482484843Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Imbalanced data set is a widespread data form in the area of data mining, and it refers to that the number of data samples of different categories generates a large gap.Due to the wide gap of quantity, the effect of normal classification algorithm is not obvious. In the field of data mining high dimension is also an inevitable problem.Gathering and processing of datasets result in high dimension of attributes, a large number of attributes express a small amount of information, bringing inconvenience to data mining. The imbalance and high dimension of data sets make trouble for data analysis and knowledge discovery, so the research on this kind of dataset has received more and more attention. With the rapid development of computer technology and progress, the classification problems basing on data mining and machine learning become the method of high-speed decision, accurate judgment and effective auxiliary of enterprise and organization. And the imbalanced data sets with high attributes dimension widely appear in computer science, bioinformatics,economics and other fields of application, for the imbalance that people often care about the minority classes, and for high dimensional data people often concern about the interference in machine learning caused by a large number of attributes with a small amount of information. So it is particularly important for the processing of such datasets.In this paper, we first introduce the imbalanced datasets and high dimension of attribute, and experts at home and abroad promote the progress of research on such data sets, this paper expounds the influence of high-dimensional imbalanced dataset on data mining, the commonly used processing methods and the standard evaluation metric of classification performance on imbalanced data sets. The strategy of a cluster boundary re-sampling method based on DBSCAN algorithm combined with support vector machine(SVM) can effectively solves the data imbalance problem,but it works bad if imbalance dataset has too many attributes. So the method of dimensionality reduction based on signal to noise ratio is added in the strategy to solve high dimensional imbalance. In addition, the SOM algorithm and the samplingmethod based on data generation are introduced, a novel re-sampling method based on SOM is proposed by combining the two principles to solve the problem of imbalanced dataset, and the strategy based on Relief and SOM is put forward. At last,the two strategies are applied to DNA microarrays, and it is of practical significance using the methods of data mining in DNA microarrays. Besides, the experiments verify that the two strategies can deal with high-dimensional imbalanced data effectively, and addressing DNA microarrays data is feasible by the strategies.
Keywords/Search Tags:imbalanced data, high dimension, re-sampling, self organizing map
PDF Full Text Request
Related items