Research On Several Methods Of Data Classification In Adapting To Bad Data

Posted on:2013-03-16

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J L Li

Full Text:PDF

GTID:1228330374986905

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Some characteristics of data can have negative effects on data categorization, such as noise pollution, density variance between clusters, class imbalance, different variances on different dimensions and so on. Therefore, the research of classification approaches that can be adaptable to bad data is importantly valuable in theory and practice. Although the present classification approaches such as DBSCAN and Trimmed k-means can deal with bad data of some characteristics, the eagerness for a general approach that is adaptable to all kinds of bad data is unrealistic. So, the research on anti-jamming approaches pertinent to data characteristics has become a common view.Inspired from molecular kinetic theory and concerning the information on neighbourhood,cluster density variance and the distance in original and iterating spaces, Molecular dynamics-like Data clustering Approach is proposed in this dissertation; Similarly considering neighbourhood information and (or) feature variance, Ellipsoid-plane classification Approach is designed, and KDE-based classification approach is improved in this paper. Besides being adaptable to noise and great density variance between clusters, the new clustering approach is able to automatically find possible clusters without presetting cluster number. This approach has solved "Black Hole" problem encountered by gravitational model.KDE-based data classification algorithm is one of the classification approaches widely used in different applications. Dealing with class imbalance data, it has the problem of misclassifying data of minority class into majority class. In order to enable this approach to cope with class imbalance data, and to be effective even when class imbalance problem is acute, this paper propose an improvement that is to add a small-searching-interval smooth factor into this approach. Experiment results showed the effectiveness of the improvement.In the phase of class prediction, classification methods like the KDE-based approach can be involved in computing the whole data, so that computation cost in this phase is rather high. In order to reduce prediction cost and to make classification model embrace variance information on feature dimension, a new Ellipsoid-plane classification approach is proposed in this paper. It is a two-stage supervised method, which uses elliptic surface and plane as reference surfaces for classification. Because the computation in classifying phase only involves testing point and reference surfaces, the computation cost in this phase is less than the distance-based k-nn method and the KDE-based approach. Moreover, ellipsoid-plane classification approach also strengthens neighbourhood principle.Besides theoretical analysis, the approaches mentioned above are also compared with other present methods in experiments, which has confirmed rightness of the theoretical derivations, and provides a new and valuable exploration in bad data classification.

Keywords/Search Tags:

data classification, data clustering, pattern recognition, molecularkinetic theory, bad data

PDF Full Text Request

Related items

1	Scientific Data Mining System Of Classification And Clustering Applications
2	Research On Nominal Data Clustering/Classification Algorithms With Their Applications In Anomaly Detection
3	Research And Application Of Data Classification Based On The Theory Of Belief Functions
4	The Research On The Method Of QAR Data Organization Based On Data Warehouse And The Similarity Measurement Of Clustering Pattern
5	The Algorithm Research On Clustering Analysis Of The Pathological Data Based On Hypothesis Oriented Classification
6	Data Fusion And Data Mining Theory Applied Research In Target Recognition
7	Theory and application of a microclustering tool for exploratory data analysis in pattern recognition systems
8	Research And Application Of Data Analysis System For Large Scale Real Time Power Data
9	Study On Clustering Algorithm And It's Applications
10	Research Of Data Fusion Algorithm Based On Clustering D-S Evidence Theory