Font Size: a A A

Application And Research Of The Fuzzy C-Means Clustering In Gene Express Data

Posted on:2013-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZhangFull Text:PDF
GTID:2218330374962973Subject:Biological Information Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of microarray technology, it has brought about themass of microarray data in exponential growth pattern. Facing such a large number ofmicroarray data, if it can not take effective ways to deal with data resources, a lot ofuseful micro-array data resources, will become a "data disaster" or un-useful "datarubbish". Because microarray data have mass, high dimension, less sample, includingnoise, high pollution and high mutation rate ect characteristic. How to extractimportant biological information from these data and make the results of the analysisfor human services, will be a great challenge. In order to meet the challenge andimprove the overall utilization of information, especially under the premise that thereis no a priori knowledge or the lack of priori information, the convenience of researchand analysis problems which promoted the theory and application of fuzzy C-meansclustering has become a hot topic of bioinformatics research in recent years. In viewof it,this article have thorough studied on the most popular fuzzy C-means clusteringalgorithm based on the objective function and combined with the characteristic of thegene expression data and improved it, then it is applied to gene expression dataanalysis, the main work and innovation are as followed:Firstly, in the course of the pre-processing technology of gene expression data,especially when data screening,the characteristic of the gene expression data has beentaken into account.Combined the experimental conditions of gene expression dataacquisition with the biological meaning and statistical significance of gene expressiondata indicators: DETECTION P_VALUE and ABS_CALL, proposed a new datacoarse screening method,then put forward to the "three-step"data filtering method onthe base of previous research.Secondly, carefully study the fuzzy C-means clustering algorithm theory andresearch profile, in view of its shortcomings, fully considering gene expression datacharacteristics, recommended previous weighted fuzzy C-means clustering algorithm.In this paper, combining with dimensionality reduction characteristic of principalcomponent analysis, the author has put forward to a new-weight determiantion method based on the compensation of loss information.Third, fuzzy C-means clustering algorithm is particularly vulnerable to the initialparameters,such as the number of clusters C, the initial cluster centers, and comes intobeing unstable clustering results. First of all, on the basis of previous research,theauthor redefined the number of clusters C so that effectively avoid the blindness ofrandomly selecting the number of clusters. then preferred the initial cluster centers,based on the system clustering. Finally, in the condition that the initial cluster centersare selected randomly or cluster centers is initialized,adopting the standard fuzzy Cmeans clustering method and the improved algorithm is classified on the bronchialepithelial cell samples which are affected by time conditions and the smoke thatcomes from different brands of cigarette. Practice has proved that the improvedalgorithmit not only obtains the better clustering results, but also accelerates theconvergence rate.Fourth, it gives a reasonable biological explanation on the gene expression dataclustering results.
Keywords/Search Tags:gene chip, gene expression data, fuzzy C-means clustering
PDF Full Text Request
Related items