Font Size: a A A

Research Of Fuzzy Clustering Method On Imbalanced Dataset And Its Application In Metagenomic Contigs Binning

Posted on:2017-04-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:1310330512958036Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Imbalanced dataset means that some parameters vary greatly inter-classes, such as size, sample number and sample density, and widely exists in many practical problems, such as medical disease dataset, network intrusion dataset and metagenomic dataset. Traditional unsupervised classification method, for example fuzzy c-means method(FCM), has a poor clustering performance for imbalanced dataset. Till to now, most of studies on imbalanced dataset focus on supervised learning. So, it is of great importance to research and improve the clustering performance of traditional unsupervised classification method for imbalanced dataset, which could further improve the studies of imbalanced dataset and extend application range of traditional unsupervised classification method.This paper mainly focuses on fuzzy c-means method(FCM), and researches several related key problems when clustering imbalanced data, and apply them into metagenomic contigs binning, which are described as follows:(1) Propose a new FCM based on cluster volume constraintTraditional FCM has a poor clustering performance for imbalanced dataset, the reason of which is that it uses a sum-of-square objective function and equalizing cluster volumes will lead to smaller value of objective function. As a result, a part of samples from majority class will be partitioned into its adjacent minority class incorrectly. To solve this problem, this paper proposes a new fuzzy c-means method based on cluster volume constraint, in which a new objective function is designed by considering volume of each cluster. This new objective function allows the existence of minority cluster so that the clustering performance of traditional fuzzy c-means method for imbalanced data is improved. Cluster volume is the sum of membership values of all samples to this cluster, and can be used to measure size of a cluster.(2) Propose a new global fuzzy c-harmonic means method based on cluster volume constraintTraditional fuzzy c-means method is sensitive to initial cluster centers. To solve this problem, this paper proposes a global fuzzy c-means method based on c-harmonic means method and the method proposed in(1). This global method is insensitive to initial cluster centers and can be used for imbalanced dataset clustering.(3) Propose a new fuzzy clustering validity index for imbalanced datasetDetermining cluster number is an important issue in unsupervised machine learning. Cluster number must be predefined when using FCM. A common method to determine cluster number is to run FCM method several times with different cluster number and select one of these cluster results by a predefined function, which is called clustering validity index(CVI). Existing CVIs commonly evaluate clustering result by compactness inner-cluster and separation inter-cluster. However in imbalanced dataset, different sizes between clusters will impact the evaluation performance of compactness measure. Thus, this paper designs a new fuzzy compactness measure by considering cluster volume, and a new clustering validity index is proposed by combining traditional separation measure, which could evaluate clustering result of imbalanced dataset effectively, as well as balanced dataset.(4) Research metagenomic contigs binning based on imbalanced data analysisMetagenomics uses next-generation sequencing technology to acquire genetic material from environment without cultivation in laboratory. Different to traditional sequencing method, the raw metagenomic sequencing data contains large number of short DNA reads from mutisepecies, which could be assembled into long DNA sequences according to their overlapping relations. These long DNA sequences are called contigs in bioinformatics. It is a step of very importance to bin metagenomic contigs according to their species origins. However, due to several factors, such as uneven abundance ratios and different genome lengths inter-species, number of contigs belonging to different species usually varies greatly. So metagenomic contigs dataset is a kind of imbalanced dataset. One current research difficulty is how to classify contigs efficiently.To improve the binning accuracy of metagenomic contigs, above methods are used to do cluster analysis for metagenomic contigs. Firstly, a range of species number in a metagenome is determined by genome lengths of existing bacteria and average coverage of this metagenome. Then, 4-mer frequencies of DNA contigs are extracted and normalized as feature vectors for classification. Finally, the global fuzzy c-harmonic means method and new fuzzy CVI are used for contigs binning. Compared with existing unsupervised binning methods, the propose method could achieve better performance for metagenomic dataset.In summary, this paper conducts a systematic research of unsupervised classification for imbalanced dataset, and propose a set of methods in which initialization, clustering and clustering validity index are included. Better performance is achieved by applying these methods into metagenomic contigs binning.
Keywords/Search Tags:cluster analysis, imbalanced data, fuzzy c-means method, initialization, clustering validity index, metagenomic contigs binning
PDF Full Text Request
Related items