The Research Of Multi-clusters Ib Algorithm For Imbalanced Data Set

Posted on:2016-08-29

Degree:Master

Type:Thesis

Country:China

Candidate:P Jiang

Full Text:PDF

GTID:2308330461451142

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Information Bottleneck(IB) is a well-known and widely used unsupervised data organization technique that is based on rate distortion theory. IB method tries to extracts the data patterns by compressing them into the “Bottleneck variable” and maximally preserving the mutual information between them. The study of imbalanced data is a challenging research filed. It is one of the most interesting and important problems in the machine learning, data mining and pattern recognition research areas. The imbalanced data is defined as that the sample number of one member class called a minority class is significantly less than the other classes called as majority classes in the data set. However, many traditional pattern recognition algorithms do not work well enough on class imbalanced data sets with skewed distributions. They often produce clusters of relatively uniform sizes, even if the input data sets have varied cluster sizes, which is called the ”uniform effect”.When dealing with imbalanced data sets, the original IB method tends to produce clusters of relatively uniform size, resulting in the problem that the clustering effect is not ideal. To solve this problem, in this paper we are going to propose a Multi-clusters Information Bottleneck(Mc IB) algorithm for dealing the imbalanced data. Mc IB algorithm tries to reduce the skewness of the data distributions by the idea of under-sampling method to divide the imbalanced data sets into multiple relatively uniform size clusters. Entire algorithm consists of three steps:First, a dividing measurement standard is proposed to determine the sampling ratio parameter; second, Mc IB algorithm preliminary analyses the data to generate reliable multi-clusters; last, Mc IB algorithm merges clusters into one bigger size cluster according to the similarity between clusters and organizes multiple clusters representing the actual cluster to obtain the final clustering results. Compared with the under-sampling approaches, Mc IB algorithm can avoid the important information lose of the majority class. Compared with the IB method, Mc IB algorithm can effectively solve the “uniform effect” problem of IB method on unbalanced data set”. The experimental results show that the Mc IB algorithm can effectively mine the patterns resided in the imbalanced data sets. Compared with other common clustering algorithms, the performance of the Mc IB algorithm is better.The proposed Mc IB algorithm can be applied in many fields, such as clustering analysis, anomaly detection, information retrieval. Experimental results show that Mc IB method can get clustering results of high quality. Besides, the Mc IB algorithm can be applied for a wider range of data collection than the original IB method, and it provides for the IB theory a new idea on the analysis of data on the unbalanced data sets.

Keywords/Search Tags:

Information Bottleneck method, Imbalanced data, Multi-clusters, Cluster merging

PDF Full Text Request

Related items

1	Research On Initial Cluster Centers Choice Algorithm And Clustering For Imbalanced Data
2	An Automatic Method To Determine The Number Of Clusters Based On Multi-Validity Indices
3	Research On Multi-way Information Bottleneck Method
4	Algorithms Implementation Of Determining The Number Of Clusters And Initial Cluster Centers For Mixed Data
5	Research Of Multi-class Imbalanced Data Classification Method
6	Multi-feature Clustering Based On Multivariate Information Bottleneck Method
7	Research On Clustering Methods For The Data With Large Number Of Clusters
8	GMM Trees And Forests:Hierarchical Algorithms For Estimating The Number Of Clusters In High Dimensional Complex Data
9	An Improved LFCM Algorithm Based On Iterated Entropy Weight And Its Application In Imbalanced Data Sets
10	Research Of Imbalanced Datasets Preprocessing Combined With Clustering