Research On Measures Of Geometrical Complexity In Mbalanced Classification Problems And Its Application

Posted on:2013-06-17

Degree:Master

Type:Thesis

Country:China

Candidate:K Liu

Full Text:PDF

GTID:2248330371481338

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

Classification is one of the key problems of pattern recognition, machine learning and data mining. Various algorithms are designed for classification tasks, some of which are used to pre-process the data while the others are used to learn the classifiers. As the statistical learning theory was established in1990s, the research in the field of classification algorithms is thorough. At the same time, an important and urgent problem gradually emerges. For a certain data set, how to select the most appropriate one from the various algorithms instead of adopting the traditional trial-and-error approach. For imbalanced data, the problem is especially more difficult to solve.Basing on the in-depth discussion of TK Ho’s theory of data complexity, this thesis investigates in solving the above problem with the help of classification complexity and the measures of data characteristics. A heuristic classification learning framework for imbalanced data is proposed, which outperforms the trial-and-error mechanism. The main contributions are as the following:Firstly, the state of the art techniques of classification complexity, data characteristics measures and imbalanced data learning are reviewed. It is noted that in the field of classification learning, there are less guidelines to choose a suitable one from the various algorithms (Chapter1).Secondly, basing on the geometric complexity of data, a heuristic framework of classification learning is proposed. With the help of the data complexity measures, the algorithm selection is instructive and cost-effective (Chapter2).Thirdly, to verify the adaptability of the proposed heuristic framework on imbalanced data, a rigorous statistical experiment is designed and performed. The experimental results demonstrate that the geometric complexity measures of data are seriously affected by the IR (Imbalance Ratio) and can not be used directly on imbalanced data sets (Chapter3).Fourthly, through modifying and improving the existing data complexity measures, a set of complexity measure for imbalanced data is proposed. The evaluation experiments are performed on both artificial and real world data sets. The experimental results show that the proposed measures are invariant to the IR (Chapter4).Finally, focusing on imbalanced data classification, the proposed complexity measures are adopted as the instructive guidelines to choose the suitable algorithms for data pre-processing, such as over-sampling and under-sampling. From the experimental results, several meaningful conclusions can be drawn to select the sampling methods and sampling ratios (Chapter5).

Keywords/Search Tags:

Classification Complexity Level, Geometrical Data Complexity, AlgorithmSelection, Imbalanced Data, AUC, Over-sampling, Under-sampling

PDF Full Text Request

Related items

1	Analysis On Sampling Complexity Of Association Rule Mining
2	The Effects Of Data Imbalance On The Performance Of Data Complexity Measures
3	Research On Imbalanced Dataset Classification Algorithm Based On Sampling
4	Research On Classification Method Of Imbalanced Data Set Based On Improved Sampling Strategy
5	The Research Of Imbalanced Data Classification
6	The Algorithm Research Of Associative Classification And Classification Based On Imbalanced Data
7	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
8	Imbalanced Data Classification Algorithm Based On Unsupervised Intelligent Under Sampling Method
9	Research On Hybrid Sampling Of Imbalanced Data Based On Data Distribution
10	Research On The Classification Algorithm Of Imbalanced Data Sets