Font Size: a A A

The Research Of Multi-clusters Ib Algorithm For Imbalanced Data Set

Posted on:2016-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:P JiangFull Text:PDF
GTID:2308330461451142Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Information Bottleneck(IB) is a well-known and widely used unsupervised data organization technique that is based on rate distortion theory. IB method tries to extracts the data patterns by compressing them into the “Bottleneck variable” and maximally preserving the mutual information between them. The study of imbalanced data is a challenging research filed. It is one of the most interesting and important problems in the machine learning, data mining and pattern recognition research areas. The imbalanced data is defined as that the sample number of one member class called a minority class is significantly less than the other classes called as majority classes in the data set. However, many traditional pattern recognition algorithms do not work well enough on class imbalanced data sets with skewed distributions. They often produce clusters of relatively uniform sizes, even if the input data sets have varied cluster sizes, which is called the ”uniform effect”.When dealing with imbalanced data sets, the original IB method tends to produce clusters of relatively uniform size, resulting in the problem that the clustering effect is not ideal. To solve this problem, in this paper we are going to propose a Multi-clusters Information Bottleneck(Mc IB) algorithm for dealing the imbalanced data. Mc IB algorithm tries to reduce the skewness of the data distributions by the idea of under-sampling method to divide the imbalanced data sets into multiple relatively uniform size clusters. Entire algorithm consists of three steps:First, a dividing measurement standard is proposed to determine the sampling ratio parameter; second, Mc IB algorithm preliminary analyses the data to generate reliable multi-clusters; last, Mc IB algorithm merges clusters into one bigger size cluster according to the similarity between clusters and organizes multiple clusters representing the actual cluster to obtain the final clustering results. Compared with the under-sampling approaches, Mc IB algorithm can avoid the important information lose of the majority class. Compared with the IB method, Mc IB algorithm can effectively solve the “uniform effect” problem of IB method on unbalanced data set”. The experimental results show that the Mc IB algorithm can effectively mine the patterns resided in the imbalanced data sets. Compared with other common clustering algorithms, the performance of the Mc IB algorithm is better.The proposed Mc IB algorithm can be applied in many fields, such as clustering analysis, anomaly detection, information retrieval. Experimental results show that Mc IB method can get clustering results of high quality. Besides, the Mc IB algorithm can be applied for a wider range of data collection than the original IB method, and it provides for the IB theory a new idea on the analysis of data on the unbalanced data sets.
Keywords/Search Tags:Information Bottleneck method, Imbalanced data, Multi-clusters, Cluster merging
PDF Full Text Request
Related items