Font Size: a A A

Researches On Mathematical Models For Data Clustering And Applications

Posted on:2008-12-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:S ChenFull Text:PDF
GTID:1118360272957302Subject:Light Industry Information Technology and Engineering
Abstract/Summary:PDF Full Text Request
Bioinformatics is the interdiscipline of computational molecular biology and computer science.With the rapid development of data mining,biological technologies are reshaping the human society.This paper studies mathematical models for data clustering and applications for bioinformatics.The main content,contribution and innovation in the paper are described below:(1) The research of dense regions in microarray data.A dense region is data subset of statistical significance. It can identify similar subgroup of genes or samples,on the other hand,it can also get rid of outlier and abnormal data. After studying properties of dense regions,we classify dense regions according to their properties and then give the corresponding algorithms.We can also find biological meaning based on their distribution property. In the two experiments, the first dataset contains data of 30 beta-mannanase samples. Beta-mannanase's products concentration is from low to high. The algorithm can identify the subsets of samples and data patterns simultaneously.The second datast is the microarray of gene expression during yeast's cell cycles.The algorithm also can work well.After the comparsion with four other clustering algorithms for two synthetic datasets,it demonstrates its usefulness.(2) Detecing modules in gene networks.Genes and their protein products carry out celluar processes in the context of functional modules.Thus it's critical to identify these modules in order to know the gene network structure.We combine unsimilar measurement with clustering method,putting a new way to identify modules.furthermore,based on topological overlap matrix(the method has been verified in many biological applications),we come up with a new generalized method combined with bidirectional hierarchical clustering.It mainly apply in detecting modules,measurement of nodes.It performs better than other measurements between nodes.Meantime,we give a proof of its applications.At last ,normal topological matrix can find small modules,whereas this bidirectional hierarchical clustering based on generalized topogical matrix can work well in find large modules in the gene networks..(3) Ensemble methods for high dimensional clustering based on random projection.We explore how to employ ensemble methods to solve high dimensional data clustering problems.I investigate three different approaches to constructing ensembles based on randomized dimension reduction,particularly,we employ a new clustering method based on OPTOC algorithm.The results demonstrate the random projection is an effective approach for generating cluster ensembles for high dimensional data and that its efficacy is attributable to its ability to produce diverse base clusterings,then I employ a graph based approach which tansforms the problem of combining clusterings into a bipartite graph.Comparisons of the bipartite approach to three existing approaches illustrate that the bipartite approach achieves the best overall performances.(4) A scale dependent model for clusteringWe employ a model for clustering a set of high dimensional data into subsets of homogenous clusters which are well separated by each other. A novel feature of this model is that it allows the user to directly control the scale of the clusters.This is realized by formulating the clustering problem as an optimization problem. We study some properties of homogeneity and separation defined based on pair-wise measured by the Pearson correlation coefficient,particularly,we use Renyi's Entropy to represent the index of homogeneity and separation. In this case,for a dataset, the performance of the algorithm is better than typical hierarchical and partitional algorithms Experimental results on synthetic,biological and iamge data demonstrate the usefulness of proposed model.Finally, there are concluded with a summary and some problems needed to be studied in future are put forward.
Keywords/Search Tags:Bioinformatics, gene, data mining, cluster, topological overlap matrix, random projection
PDF Full Text Request
Related items