Font Size: a A A

Research On Measures Of Geometrical Complexity In Mbalanced Classification Problems And Its Application

Posted on:2013-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:2248330371481338Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Classification is one of the key problems of pattern recognition, machine learning and data mining. Various algorithms are designed for classification tasks, some of which are used to pre-process the data while the others are used to learn the classifiers. As the statistical learning theory was established in1990s, the research in the field of classification algorithms is thorough. At the same time, an important and urgent problem gradually emerges. For a certain data set, how to select the most appropriate one from the various algorithms instead of adopting the traditional trial-and-error approach. For imbalanced data, the problem is especially more difficult to solve.Basing on the in-depth discussion of TK Ho’s theory of data complexity, this thesis investigates in solving the above problem with the help of classification complexity and the measures of data characteristics. A heuristic classification learning framework for imbalanced data is proposed, which outperforms the trial-and-error mechanism. The main contributions are as the following:Firstly, the state of the art techniques of classification complexity, data characteristics measures and imbalanced data learning are reviewed. It is noted that in the field of classification learning, there are less guidelines to choose a suitable one from the various algorithms (Chapter1).Secondly, basing on the geometric complexity of data, a heuristic framework of classification learning is proposed. With the help of the data complexity measures, the algorithm selection is instructive and cost-effective (Chapter2).Thirdly, to verify the adaptability of the proposed heuristic framework on imbalanced data, a rigorous statistical experiment is designed and performed. The experimental results demonstrate that the geometric complexity measures of data are seriously affected by the IR (Imbalance Ratio) and can not be used directly on imbalanced data sets (Chapter3).Fourthly, through modifying and improving the existing data complexity measures, a set of complexity measure for imbalanced data is proposed. The evaluation experiments are performed on both artificial and real world data sets. The experimental results show that the proposed measures are invariant to the IR (Chapter4).Finally, focusing on imbalanced data classification, the proposed complexity measures are adopted as the instructive guidelines to choose the suitable algorithms for data pre-processing, such as over-sampling and under-sampling. From the experimental results, several meaningful conclusions can be drawn to select the sampling methods and sampling ratios (Chapter5).
Keywords/Search Tags:Classification Complexity Level, Geometrical Data Complexity, AlgorithmSelection, Imbalanced Data, AUC, Over-sampling, Under-sampling
PDF Full Text Request
Related items