Researches On Mathematical Models For Data Clustering And Applications

Posted on:2008-12-02

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S Chen

Full Text:PDF

GTID:1118360272957302

Subject:Light Industry Information Technology and Engineering

Abstract/Summary:

PDF Full Text Request

Bioinformatics is the interdiscipline of computational molecular biology and computer science.With the rapid development of data mining,biological technologies are reshaping the human society.This paper studies mathematical models for data clustering and applications for bioinformatics.The main content,contribution and innovation in the paper are described below:(1) The research of dense regions in microarray data.A dense region is data subset of statistical significance. It can identify similar subgroup of genes or samples,on the other hand,it can also get rid of outlier and abnormal data. After studying properties of dense regions,we classify dense regions according to their properties and then give the corresponding algorithms.We can also find biological meaning based on their distribution property. In the two experiments, the first dataset contains data of 30 beta-mannanase samples. Beta-mannanase's products concentration is from low to high. The algorithm can identify the subsets of samples and data patterns simultaneously.The second datast is the microarray of gene expression during yeast's cell cycles.The algorithm also can work well.After the comparsion with four other clustering algorithms for two synthetic datasets,it demonstrates its usefulness.(2) Detecing modules in gene networks.Genes and their protein products carry out celluar processes in the context of functional modules.Thus it's critical to identify these modules in order to know the gene network structure.We combine unsimilar measurement with clustering method,putting a new way to identify modules.furthermore,based on topological overlap matrix(the method has been verified in many biological applications),we come up with a new generalized method combined with bidirectional hierarchical clustering.It mainly apply in detecting modules,measurement of nodes.It performs better than other measurements between nodes.Meantime,we give a proof of its applications.At last ,normal topological matrix can find small modules,whereas this bidirectional hierarchical clustering based on generalized topogical matrix can work well in find large modules in the gene networks..(3) Ensemble methods for high dimensional clustering based on random projection.We explore how to employ ensemble methods to solve high dimensional data clustering problems.I investigate three different approaches to constructing ensembles based on randomized dimension reduction,particularly,we employ a new clustering method based on OPTOC algorithm.The results demonstrate the random projection is an effective approach for generating cluster ensembles for high dimensional data and that its efficacy is attributable to its ability to produce diverse base clusterings,then I employ a graph based approach which tansforms the problem of combining clusterings into a bipartite graph.Comparisons of the bipartite approach to three existing approaches illustrate that the bipartite approach achieves the best overall performances.(4) A scale dependent model for clusteringWe employ a model for clustering a set of high dimensional data into subsets of homogenous clusters which are well separated by each other. A novel feature of this model is that it allows the user to directly control the scale of the clusters.This is realized by formulating the clustering problem as an optimization problem. We study some properties of homogeneity and separation defined based on pair-wise measured by the Pearson correlation coefficient,particularly,we use Renyi's Entropy to represent the index of homogeneity and separation. In this case,for a dataset, the performance of the algorithm is better than typical hierarchical and partitional algorithms Experimental results on synthetic,biological and iamge data demonstrate the usefulness of proposed model.Finally, there are concluded with a summary and some problems needed to be studied in future are put forward.

Keywords/Search Tags:

Bioinformatics, gene, data mining, cluster, topological overlap matrix, random projection

PDF Full Text Request

Related items

1	Research On Several Key Technologies Of Gene Expression Data Mining
2	Application Of Artificial Neural Network In Research In Bioinformatics
3	Cluser Analysis And Its Application In Gene Expression Data
4	Applications Of Data Mining Techniques To Text Classification And Bioinformatics
5	Research And Implement Of Gene-Gene Relations Mining System Based On Biomedical Literature
6	Research On Classification Of Gene Expression Data Based On Adjacency Matrix Decomposition
7	Research On Analysis Of Gene Expression Profile Data In Bioinformatics
8	Research On A Few Key Issues In Bioinformation Data Mining And Its Application
9	Construction Of Gene Expression Data Mining Models
10	Research On Clustering Methods For Analyzing Overlapping Local Gene Expression Patterns