Font Size: a A A

Research On Methods Of Data Processing And Analysis For Microarrays

Posted on:2012-02-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:1100330335450231Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays, people know more and more about themselves as the development of biological technology. Especially after the human genome project (HGP), we have got the most important material that can provide the code of lives. We can easily have all the genome information of any organism by sequencing technique. But there is a long way to go until we actually crack the code of genome because of the enormous size and complexity of genomes. The coding, transcribing and regulation mechanisms behind genomes are waiting us to find out. Sequencing technology has advanced to such a level that large sequencing centers such as the Joint Genome Institute (JGI) of the Department of Energy can sequence a prokaryotic genome within one day, so the most critical mission for scientific researchers is to reveal the complicated rules in magnanimity omic data. Bioinformatics is a research area that combines biology and computer science to crack the code of lives.In this paper, bioinformatics methods are used to analyze genetic information. To be more specific, we mainly focus on the analysis of gene expression profile. Microarrays are a powerful tool for high-throughput measurement of gene expression and. More and more groups employ microarrays on various valuable areas such as cancer research and agriculture research. We will research on noise control of microarrays and significance analysis of gene expression. Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model and detection of differentially expressed genes in microarrays with the overall variability are proposed in this paper and applied in human cancer microarrays and crops microarrays.The potential mislabeled samples would deteriorate classification accuracy seriously, especially for supervised learning procedures. Consequently, effective methods for labeling errors detection are necessary to improve the analysis procedure of microarray data. Researchers proposed many approaches for detecting labeling errors when the number of features is usually smaller than the size of the samples. But most of existing approaches are not suitable for microarray data due to the characteristics of high dimensionality and small sample size. There are some studies trying to identify the wrong labeled samples from microarray datasets exclusively. However, these methods were mainly applied on only one microarray dataset. Malossini et al. (2006) proposed two data perturbing methods, named as the CL-Stability algorithm and the LOOE-Sensitivity algorithm, respectively, for labeling error detection. The CL-Stability algorithm is similar to a voting procedure in which if the number of dissenting votes against the original label for a sample is bigger than a threshold this sample will be considered as a suspect. The LOOE-Sensitivity algorithm focuses on flipped samples and tries to identify the wrong labeled samples according to the results with these samples flipped. But the failure of measuring the effect of the perturbation on the classifier could cause the poor performance of the LOOE-Sensitivity algorithm.In this paper, the perturbing influence value (PIV) is defined to measure the effect of data perturbation on the regression model. Based on the PIV value, the Column Algorithm (CAPIV) and the Row Algorithm (RAPIV) are proposed adopting different perspec-tives on the effect of perturbing influence. However, RAPIV are affected by the mislabeled samples which lead to errors in the calculation of the perturbing influence value. The threshold in RAPIV is set equal to 0 which cannot make up with the errors. In this paper we argue that the thresholds should change with the number of samples with labeling error, and find out the relationship between them. RAPIV algorithm with Parameter Adjustment (PA-RAPIV) and CAPIV algorithm with Parameter Adjustment (PA-CAPIV) are proposed based on the idea of adjusting threshold. In order to improve the RAPIV algorithm, the Progressive Row Algorithm based on the Perturbing Influence Values (PRAPIV) is proposed with a progressive correction procedure. We apply the proposed methods together with the simple SVM method and the CL-Stability algorithm to six artificial datasets and five microarray datasets. Experimental results show that the PRAPIV algorithm can increase precision and achieve high recall.We applied the proposed methods to crops and human cancer microarray data and artificial data. Experimental results show that PRAPIV presents a better balance between precision and recall than other methods. The recall values of the simple SVM are large, but its precision values are small. The SVM method is good at classification, but the precision of classification cannot reach 100%. The samples misclassified by SVM become false positives in labeling error detection. Compared with the SVM method, the advantage of CL-Stability is that there are more classification results generated by Leave-One-Out method. Only when those classification results show some statistical significance, a sample will be detected as a wrong-labeled suspect. This advantage can help to limit the number of false positive samples, but it also makes some wrong labeled samples not be detected. Actually, the high precision values of CL-Stability are at the expense of recall. The CAPIV and RAPIV algorithms can keep the advantage of SVM providing high recall values, but their precision values are still very small. The reason is that the wrong labeled samples cause the imprecise calculations of the TIV values and the IIV values. PA-RAPIV can improve the performance of RAPIV. but it limited by the unknown number of mislabeled samples. The PRAPIV algorithm can overcome this deficiency of CAPIV and RAPIV by progressively correcting the suspects, so it has both high precision and high recall.Regular solutions for selecting the set of genes that are significantly expressed in the condition with respect to the control are statistical methods. There are two ways for statistical test:Parametric test and Non-parametric test. Parametric test such as t-test and Welch test assume that expressed values follows normal distribution, and non-parametric test such as Permutation test and Wilcoxon Rank Sum test do not assume the expressed distribution. However, either way of those methods only focus on a single dataset with no overall consideration of genes expression in other datasets. Our goal is to develop a reliable method for identifying differentially expressed genes in microarrays, which takes into account the overall variability of the gene in a wide range of studies.In this paper, the coefficient of variation (MADCV) and the improved coefficient of variation (IMADCV) based on the mean absolute deviation are defined to measure the variability of expressed level of genes. It can be concluded that the non-classified coefficients such as CV and MFC perform better than the classified coefficents such as FC and t statistical value in the datasets with less samples. In comparison with existed methods such as ttest. SAM and RankProd, MADCV and IMADCV perform better than other methods no matter the sample numbers are large or small in the dataset.One cannot precisely detect differentially expressed genes only with a coefficient of variation because it is difficult to select the threshold for the coefficient. The methods using overall datasets can reflect the distribution of the coefficient of variation. The methods of overall distribution, overall outlier and overall permutation are proposed in this paper. We apply them to human and crops microarray data, and the experimental results show that the overall permutation method based on IMADCV can reach to more than 90% for either precision value or recall value.
Keywords/Search Tags:Microarrays, labeling error detection, data perturbation, differentially expressed gene
PDF Full Text Request
Related items