Font Size: a A A

Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression microarrays

Posted on:2015-08-21Degree:Ph.DType:Thesis
University:University of Alberta (Canada)Candidate:Vaisipour, SamanFull Text:PDF
GTID:2470390017997862Subject:Computer Science
Abstract/Summary:
Gene expression microarrays are widely used to better understand the complex biological mechanisms inside cells. One of the main obstacles of applying statistical learning algorithms to microarray data is the large gap between the number of features (p) and the number of available instances (n), i.e., the " large p, small n" challenge. This thesis explores two ways to deal with this challenge.;This thesis proposes a novel batch correction algorithm, called batch effect correction using canonical correlation analysis (BECCA), that assumes the batch effect is due to additive independent confounding factors and so utilizes canonical correlation analysis to separate technical bias from the measured biological signal. We compare BECCA to various existing batch correction algorithms using several real-world gene expression studies and find that BECCA has similar performance. The key advantage of utilizing BECCA, compared to other similar performing algorithms, is its flexibility, as BECCA allows the user to adjust how much common signal to preserve across the batches and how much batch related signal to remove from each one by changing the values of BECCA parameters.;The second approach to batch correction considers the wisdom of reducing p by selecting a subset of genes. Our experiments suggest that some genes in microarray data sets contain very little biological signal, i.e., including only these genes in the calculations makes all specimens highly correlated, regardless of their tissue of origin or disease state. It is, therefore, desirable to identify and remove these misleading genes before conducing downstream analysis or batch correction. For this purpose, we propose an efficient algorithm to extend the single-study variance-based gene selection method to a multi-study gene selection algorithm. Our empirical results show this feature selection algorithm outperforms other algorithms in reducing the destructive influence of batch effects.;One approach is to increase n by combining similarly appropriate microarray data sets together. This is appealing as there are now many publicly available microarray studies. The main problem of this approach is the batch effect, i.e., the influence of non-biological factors on expression intensities that can confound the biological signal. As a result, combining gene expression studies without correcting for batch effects may lead to misleading findings.
Keywords/Search Tags:Gene expression, Batch, Microarray, Biological, Data, BECCA
Related items