Font Size: a A A

Research On Biomedical Data Clustering

Posted on:2013-01-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:T BaiFull Text:PDF
GTID:1118330371982884Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Biomedical informatics is the science that deals with biomedical information, its data,knowledge storage, retrieval and optimal use, involving biomedical and the moderninformation technology, especially the basic of computer science and applied sciences. One ofthe core research areas of biomedical informatics is biomedical data mining. A large numberof biomedical experiments produce vast amounts of biomedical data, but little of the implicitinformation is known. Implicit information on diseases, which is very important to theimprovement of treatment, medication and other aspects, can be found through biomedicaldata mining. However, in the field of biomedical data mining exists an urgent problem, thatthe data generated by the biomedical experiments is often not a simple numeric attribute data,but contain categorical attributes. Therefore, how to deal with the various data types ofbiological medical data mining is a hot topic of current research in the field. In this paper, themain object of study is the clustering problem in biomedical data mining. For different typesof data are corresponding clustering algorithms proposed, and the proposed algorithms areapplied to biomedical data, which achieves a good result. The main contributions are asfollows:(1) A new co-regulated gene clustering algorithm CORE (Clustering of co-Regulatedgene) is proposed in this paper, it takes into account to all four kinds of co-regulated relationsbetween the genes. Most of the existing clustering methods are limited to the miningco-expression of the gene by gene microarray data, that is, only co-expression relationshipbetween genes can be found. Recently, a small number of studies have begun researching thecommon regulatory relationships between genes, but only one or several regulatory relationscan be found, while ignoring the information on the relationship of co-regulation of othergenes. Based on the existing research of co-regulated relations between the genes, there arefour main co-regulatory relationships between genes: positive and negative co-regulation,differential expression of control, delay, total control and part-time regulation. For these fourregulatory relations, this paper presents a co-regulated gene clustering algorithm CORE toconsider regulation of relations of co-regulated gene cluster. CORE algorithm can achieve thediscovery of four co-regulated relationship by two parts. First of all, a similarity measure isproposed, which can describe all these four gene regulatory relations, and then based on this similarity presents an improved k-mean algorithm, and targeted to the characteristics ofco-regulated genes, adaptive termination condition is defined, which is based on probabilitytheory. Experiments on yeast gene expression data show that the effectiveness and efficiencyof the CORE algorithm on the problem of clustering of co-regulated genes. Specifically, basedon its expression data, the proposed CORE algorithm clustered yeast genes into seven clusters.By observing the curve of the expression of these seven cluster centers, we found that eachcluster in the gene expression trends significantly different, indicating that the COREclustering algorithm will express different types of genes successfully separated. By searchingof genomic information in the SCPD database, we found that the genes, which are regulatedby the same regulation factor, are mostly gathered to the same cluster by CORE algorithm,and the expression data of these genes reflects these four regulatory relations: the pros andcons, differentially expressed, delay and part-time, indicating that the mining capabilities ofthe algorithm on the four control relationship is strong. Secondly, we found that thecollections of regulatory factors, which are combined by the genes, which are combined bythe CORE algorithm, mostly belong to the same group of regulatory factors, which fromanother angle proves the comprehensiveness of the algorithm for mining capabilities. Bycomparing the experimental results of CORE algorithm and the experimental results of cluster3.0combined with the most widely used co-regulated gene similarity metric way Pearsoncorrelation coefficient, we found that the CORE algorithm shows whether in the accuracy ofthe mining or in the operating efficiency a distinct advantage. In summary, the proposedCORE algorithm has a good prospect in the co-regulated gene mining.(2) A new global k-modes clustering algorithm (GKM) is proposed in this paper forovercoming the deficiencies of the traditional categorical data clustering algorithm. Theproposed algorithm select randomly a sufficient number of initial modes to cover the globaldistribution of information of the sample, and then through the elimination of the evaluationfunction to iterate the elimination of redundant mode or cluster. Iterative process GKMalgorithm complexity remains linear. Experimental results on UCI data shows that the GKMalgorithm selects randomly more initial models, in the case of a priori information does notdepend on the distribution of data sets, it still obtains a very good clustering effect. Theexperimental results of different initial models show that the use of the proposed eliminationof the evaluation criteria GKM algorithm obtains better clustering results than the randommethods. Through comparative experiments on UCI data sets show that the proposed GKMalgorithm has a higher clustering accuracy than the other categorical data clustering algorithm(KM, NKM, and FKM). Then we apply GKM algorithm on medical data sets and evaluate theeffectiveness and consistency of the clustering results. Experimental results show that theGKM algorithm has better clustering effect in the medical data set, showing that the application prospects of the GKM algorithm in this field of medicine.(3) A new mixed-attribute data clustering method is proposed in this paper and thismethod is used in cancer subtyping. Cancer sub-typing is a core issue in the field of cancerresearch. Accurate genotyping results of cancer patients can guide the refinement of treatmentand refinement medication. The main subtyping method of cancer patients is clustering. Thereal cancer data often include mixed types of data such as clinical data, gene expression data,copy number variation data and mutagenesis data. The existing cancer subtyping methods canonly cluster the gene expression data, so the subtyping results cannot obtain the full range ofpatient information. To combine the genomic data of cancer patients and clinical data, wepropose a hybrid data clustering algorithm GKP. We apply the algorithm in the largestinternational cancer samples database TCGA to cluster these cancer data: gene expressiondata, clinical data, copy number variation data, methylation data and somatic mutation data.Cancer patients are divided into four subtypes, and we verify the patient's four mainphysiological indexes. The experimental results show that the four subtypes of patients have aclear distinction in the four validation indicators. It shows that different subtypes of patientshave different physiological and prognostic characteristics, which provides a new way to thestudy of cancer sub-typing.
Keywords/Search Tags:Clustering, Biomedical data, Cancer subtype, K-means algorithm, K-modes algorithm, K-prototypes algorithm
PDF Full Text Request
Related items