Font Size: a A A

The Effects Of Data Imbalance On The Performance Of Data Complexity Measures

Posted on:2017-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:X Y JiaFull Text:PDF
GTID:2308330485978401Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Classification is the foundation of pattern recognition, machine learning and data mining, in daily work and life of classification problem have been commonplace. In recent years, more and more researchers began to be engaged in the research of data classification problems, basis of the existing research literature, according to all kinds of algorithms and improved algorithms including data preprocessing, classification learning algorithms emerge in endlessly. As the statistical learning theory was established in the early 90s,the research in the field of classification algorithms are abundant, but there is a prominent problem gradually revealed, that is in the actual problem, when we need to classify a certain data set, how to select the most appropriate one from the various algorithms. For imbalanced data, the problem is especially more difficult to be solved.Basing on the in-depth discussion of TK Ho’s theory of data complexity, this thesis investigates into solving the above problems with the help of classification complexity, creatively proposing data complexity measures which is based on the geometric statistics theory and the information theory, and use these measures to experiment respectively in the simulation and real data sets, obtaining a series of conclusion about data complexity measures on these unbalanced data, and these conclusions for the unbalanced real data how to choose classifier have important guiding significance. The main contributions are as follows:Firstly, studying the Chinese and English references published in recent years, mainly involving the aspects of classification study, the geometric complexity, data complexity measures, unbalanced data classification and so on, summarized and analyzed the research status of these issues (Chapter1).Secondly, discussion of data complexity measures and the classification problem of unbalanced data, on the basis of data complexity degree to discuss the data complexity measures, this thesis expound the relationship between the data confusion and complexity. According to unbalanced data classification problems, given its research status,and discuss the impact of the category distribution unbalance data sets on the pattern classification (Chapter2).Thirdly, to study the adaptability of the data complexity measures on the unbalanced data, this thesis makes a detailed introduction and exposition to the data complexity measures. Proposing which is based on geometric statistical theory and information theory of the data complexity measures, and for these indicators to improve and promote to adapt to different types of data sets, comparing the improved algorithms of individual index to select the optimal index (Chapters).Fourthly, to test the effectiveness of the new learning algorithm or assessment the new indicators, this thesis makes an experiment in the artificial generated simulation data sets and real data sets. Due to the controllability of the simulation data and the credibility of real data, so this article adopts the way of combing two data sets (Chapter4).Finally, applying the simulation data sets on the calculation method of data complexity measures which is based on two different angle, to make some rules and conclusions which is about the data complexity measures in the unbalanced data sets, and using the real data sets for experiments to prove the correctness of these rules and conclusions, the final experimental results show that the conclusion is real and effective, and can make use of these results and laws to guide algorithm selection of unbalanced data sets (Chapter5).
Keywords/Search Tags:Data Complexity Measures, Unbalanced Data, Simulation Data, Algorithm Selection, Data Geometrical Complexity
PDF Full Text Request
Related items