Font Size: a A A

Research Into Biological Omics Datasets Integration Based On Sparse Partial Least-square Algoirthm

Posted on:2013-08-22Degree:MasterType:Thesis
Country:ChinaCandidate:F WangFull Text:PDF
GTID:2230330371483557Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since the completion of HGP(Human Genome Project), multiple developed Omics techniques have brought life science into the system biology era. In the system biology era, technological advances enable the monitoring of an unlimited quantity of omics data outputs from various omics analysis platforms, such as transcriptomic, proteomic or metabolomic data. In order to enable an improved understanding of some underlying biological mechanisms and interactions between functional levels, we need to develop the sort of integrative biological approachs which could facilitate biological interpretation simultaneously, and we can thereby describe and predict biological function of organism, phenotype and behavior of some life process. Various omics data-types are characterized by many variables but not necessarily many samples or observations,so there is a phenomenon of linear correlation between variables.In this high dimensional setting,it is absolutely crucial to select genes, proteins or metabolites in order to overcome computational limits (from a mathematical and statistical point of view) and to facilitate biological interpretation. We usually adopt methods based on canonical correlation analysis or partial least-squares regression (PLS) to integrate omics data sets due to their acceptable computational performance. Currently popular sparse partial least-square regression(sPLS) method is a new version of the PLS, which has a built-in variable selection process and performs well in omics data integration and biological interpretation. This paper studys the sPLS and trys to improve it.sPLS is based on lasso penalization and is obtained by penalizing a sparse Singular Value Decomposition. This method could overcome computational limits of mathematics and statistics and enable a feasible biological interpretation at a reduced experimental cost. PLS overcomes the harm of multiple correlation by the means of dimensionality reduction on data structure and analyses the correlation between two datasets using the canonical correlation analysis. sPLS filters important information from data system and select sevaral components with best interpret ability. Then,we establish regression model between two data sets using the filtered principle components. Lasso (least absolute shrinkage and variable selection operation) penalty method set the coefficient of unimportant variable to zero so as to keep only the main variable in regression model.lasso realized sparse solution of omics datasets integration by selecting variable at the same time estimating regression coefficients.By the study of lasso penalty, we learned that the lasso selects at most n variables in the case p>n, where p is the number of variables and n is the sample size.What’s more,if there is a group of variables among which the pairwise correlations are very high, then the lasso tends to select only one variable from the group and does not care which one is selected. Therefore, lasso is not an ideal method for the p>n situation. This article try to use elastic net penalty to improve variable selection process in sPLS.Elastic net penalty is a regularized variable selection method and it not only can select groups of correlated like a stretchable fishing net that retains "all the big fish", but also can choose target variables in all the variables. Elastic net penalty transform the elastic net problem into an equivalent lasso problem on augmented matrix of the independent variable dataset,and get a elastic net penalty soft threshold function, thus we could operate variable selection by inflicting soft threshold function on weights vector of datasets. We acquire gene expression data and clinical liver function data in rodent liver toxicity study and implementation PLS, sPLS and the improved sPLS method on these two data sets. We compared their performance through cross validation. The experimental results show that the introduction of variable selection method could get more stable, effective prediction efficiency, and the sPLS based on elastic network penalty performs more efficiently than the original sPLS when selecting target variables.This article hava just done basic research into methods of biological omics data integration.The sPLS provide a very useful tool for the integration of two omics datasets, and it could provide good performance for biological interpretation. For future research direction, we could use other threshold rules instead of soft threshold rules, or take symmetry analysis version of sPLS into consideration.And we also could try to use other penalty function or improving existing penalty function, such as using adaptive elastic network penalty. However, because of the complexity of the biological processes,particularity and the high throughput omics data, we still need to do plenty of further work about statistical characters and biological explanation performance of sPLS in terms of theory and practice.
Keywords/Search Tags:Biological omics data, Partial least squares regression, Variable selection, LassoAlgorithm, Elatic net algorithm
PDF Full Text Request
Related items