Font Size: a A A

Research On Several Methods Of Data Classification In Adapting To Bad Data

Posted on:2013-03-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:J L LiFull Text:PDF
GTID:1228330374986905Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Some characteristics of data can have negative effects on data categorization, such as noise pollution, density variance between clusters, class imbalance, different variances on different dimensions and so on. Therefore, the research of classification approaches that can be adaptable to bad data is importantly valuable in theory and practice. Although the present classification approaches such as DBSCAN and Trimmed k-means can deal with bad data of some characteristics, the eagerness for a general approach that is adaptable to all kinds of bad data is unrealistic. So, the research on anti-jamming approaches pertinent to data characteristics has become a common view.Inspired from molecular kinetic theory and concerning the information on neighbourhood,cluster density variance and the distance in original and iterating spaces, Molecular dynamics-like Data clustering Approach is proposed in this dissertation; Similarly considering neighbourhood information and (or) feature variance, Ellipsoid-plane classification Approach is designed, and KDE-based classification approach is improved in this paper. Besides being adaptable to noise and great density variance between clusters, the new clustering approach is able to automatically find possible clusters without presetting cluster number. This approach has solved "Black Hole" problem encountered by gravitational model.KDE-based data classification algorithm is one of the classification approaches widely used in different applications. Dealing with class imbalance data, it has the problem of misclassifying data of minority class into majority class. In order to enable this approach to cope with class imbalance data, and to be effective even when class imbalance problem is acute, this paper propose an improvement that is to add a small-searching-interval smooth factor into this approach. Experiment results showed the effectiveness of the improvement.In the phase of class prediction, classification methods like the KDE-based approach can be involved in computing the whole data, so that computation cost in this phase is rather high. In order to reduce prediction cost and to make classification model embrace variance information on feature dimension, a new Ellipsoid-plane classification approach is proposed in this paper. It is a two-stage supervised method, which uses elliptic surface and plane as reference surfaces for classification. Because the computation in classifying phase only involves testing point and reference surfaces, the computation cost in this phase is less than the distance-based k-nn method and the KDE-based approach. Moreover, ellipsoid-plane classification approach also strengthens neighbourhood principle.Besides theoretical analysis, the approaches mentioned above are also compared with other present methods in experiments, which has confirmed rightness of the theoretical derivations, and provides a new and valuable exploration in bad data classification.
Keywords/Search Tags:data classification, data clustering, pattern recognition, molecularkinetic theory, bad data
PDF Full Text Request
Related items