Font Size: a A A

Clustering Analysis Algorithm Applied In Analysis Of Gene Expression Data

Posted on:2013-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:J SunFull Text:PDF
GTID:2248330362971848Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of human genome project, tens of thousands of genes andmassive growth gene sequence data are derived. But the data does not equal informationknowledge, is the source of information knowledge. How to get useful knowledge from thelarge amount of gene expression data using automatic analysis tools so the data analysismethods and tools have been paid more and more attention. Data mining technology hasbeen widely applied to gene expression profiling in many aspects, and achievedconsiderable successes. Data mining extracts useful information knowledge from a largenumber of practical applications of database, which is the hidden, unknown and potential.As a new technology, data mining provides an effective method and tool to analyze data forbiologists and a powerful means of gene expression data analysis. Methods and tools of datamining include the classification and prediction, clustering analysis, association analysis,sequence analysis and time analysis, outlier analysis etc.As a kind of effective data analysis tools, cluster analysis has been widely applied inimage processing, information retrieval, data mining and other fields. The huge amount ofgene expression data is one of the most main reasons of using clustering algorithm toanalyze the gene expression data, but also with a relatively small number of genes of knownfunction in biology. Cluster analysis is a group of samples according to their degree ofsimilarity between into several subclasses, whose basic idea is to identify groups of thesame kind, make the body the smallest difference, and different kinds of the biggestdifference.This paper introduces two parameters of the clustering algorithm similarity measurecriterion, which are Euclidean distance and Pearson correlation coefficient and put forwarda kind of proportional similarity measure, at the same time introduces two kinds ofclustering validity evaluation, the external and internal identified. In this paper, three classicalgorithms are hierarchical clustering, K_means clustering, self-organizing maps clustering.Based on the kind of similarity criterion, Hierarchical clustering is divided into fourdifferent connection clusters, and then in two kinds of similarity of four hierarchicalclustering discusses clustering validity comparison. In Euclidean distance and differentexperimental iterations, K_means clustering, self-organizing maps has correct rates of geneclustering and the better of clustering validity. Compared the advantages and disadvantagesof three algorithms, the paper proposes an improved algorithm based on hierarchicalclustering and self-organizing maps clustering, according to the experimental data, the K_means clustering, self-organizing maps has correct rates of gene clustering and thebetter of clustering validity. Compared the advantages and disadvantages of threealgorithms, the paper proposes an improved algorithm based on hierarchical clustering andself-organizing maps clustering, according to the experimental data, the improved algorithmextent overcomes the original defects of the method in some degree and embodies theadvantages itself.
Keywords/Search Tags:gene expression data, data mining, clustering analysis, validity
PDF Full Text Request
Related items