Font Size: a A A

The Research On Imputation Algorithm Of Missing Values For Gene Expression Data

Posted on:2006-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:T YangFull Text:PDF
GTID:2178360185465382Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
DNA microarray technology allows for monitoring of expression levels of thousands of genes simultaneously under defferent conditions. How to analyse the data is one of the hot problems in bioinformatics sciences. However, owing to various reasons, gene expression microarray experiments often produce multiple missing values which may affect downstream analysis. Many algorithms for gene expression analysis have great difficulty in the treament of missing values and may produce incorrect results because of a few missing values. Therefore missing value estimation for gene expression data is significant and important pretreatment process in bioinformatics data mining.The weighted based on K-nearest neighbors imputation is a classical algorithm for gene expression data. But it does not take into count correlations between genes. In this paper, a new imputation method based on Mahalanobis distance is proposed to estimate missing values in the gene expression data sets. The nearest neighbors are chosen on the base of Mahalanobis distances between genes, which utilize the correlations between genes, and then whose weight factors are determined by the Shannon entropy. This algorithm can select more correctly nearest neighboring genes and corresponding weight factors so that it has more accurate estimation of missing microarray data under a variety of conditions.The Fuzzy C-Means algorithm (FCM) is a widely used clustering algorithm, recently, which has been applied to analyse gene expression data. We have applied it to missing value estimation because it may do well in dealing with co-expression and correlations between multiple genes. The imputation method based on the Fuzzy C-Means algorithm (FCMimpute) is developed to estimate missing values in microarray data. In FCMimpute, clustering parameters are determined adaptively for different data sets, and then data are analysed by FCM clustering algorithm to estimate missing entries by clustering results. Results of experiments illustrate that FCMimpute is feasible and efficient to estimate missing values in gene expression data. In addition, clustering parameters are adaptively determined in FCMimpute so that it can improve clustering correctness. Therefore FCMimpute is an effective imputation method which can generate reliable imputed values.The Fuzzy C-Means clustering algorithm is very sensitive to the situation of the...
Keywords/Search Tags:Gene Expression Data, Missing Value, Mahalanobis Distance, Fuzzy C-Means Algorithm, Iterated Local Search
PDF Full Text Request
Related items