Objective:With the advancement of biotechniques, vast amount of genomic data are generated with no limit. Prediction and classificaiton based on these data offers a cost-effective and time-efficient way for early disease screening. However, relationship between genes and a trait may be very complex and the conversion from gene to phenotype is not a simple function of individual genes, but involves the complex interactions of many genes, which should be considered a nonlinear mapping problem. In this contex, it is very important to develop powerful and efficient statistical models that can capture any potential nonlinear relathionships. In this dissertation, we develop a kernel partial least square based prediction method via combining multiple genomic data sources to provide improved information for better performance of prediction and classification. The proposed method is expected to have better learning capacity and generalization ability.Methods:Firstly, we construct a classical kernel partial least square model, then we calculate a new composite kernel function via a convex combination of multiple kernel functions. Finally we replace the previous kernel function in the classical kernel partial least square model with the new composite kernel function. All the parameters in the composite kernel partial least square model are optimized using genetic algorithm. By constructing an appropriate composite kernel function, our method can be used to deal with the prediction or classification problem of single genomic data source or multiple genomic data sources. The performance of our method is demonstrated by simulations and real data analysis. Results:The extensive simulation studies and real data analysis show that our proposed genetic algorithm based composite kernel partial least Square model has the largest 21 FQ and the smallest RMSEP compared to its counterparts, when predicting a quantitative trait using single genomic data. It also has the largest Youden index values and the smallest classification error when predicting triple negative vs non-triple negative breast cancer patients using three genomic data sources, i.e., microRNA, mRNA and CNVs obtained from TCGA website。 Conclusion:We proposed a composite kernel approach based on the KPLS prediction framework. The composite kernel has good learning capacity as well as generalization ability.We proposed a composite kernel approach based on the KPLS classification framework. The composite kernel can fuse efficiently multiple genomic data source and obtain improved performance. Genetic algorithm can be used to solve the optimization problem of kernel parameter and kernel weight. |