Font Size: a A A

Application Of Partial Least Squares Method In Cancer Microarray Gene Expression Profiles Analysis

Posted on:2010-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z C JinFull Text:PDF
GTID:2144360275475655Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Gene microarray technology has revolutionized the way that cancer genes are monitored. It can scan thousands of genes at one time. This enhanced the efficiency greatly and cut the cost of experiments, and now it becomes one of the most powerful and versatile tools available in cancer research. But with the development of microarray technology, new problems have arisen. Under the microarray setting, the number of the variables (number of genes, p) is much larger than the sample size (number of cases, n), known as the"large p, small n"problem, and relationships between the variables are very complicated. Traditional predictive models, such as multiple linear regression model, logistic model and COX model, do not work well or even not work at all. Under similar data structure in the field of chemometric, partial least squares (PLS) method has been proved to be a useful tool, and used as a predictive modeling regression method. Partial least squares is a popular method for"soft modeling"("Soft modeling"refers to modeling without the assumptions that ordinary least squares has, e.g. no multicollinearity among variables and independent variables). With the development of bioinformatics, PLS has been introduced into gene expreesion microarray data analysis in the early of 20th century. It is characterized by high computational and statistical efficiency, visualization of data structure, and also has great flexibility and versatility.Objective:In this research, we introduced PLS into cancer gene expression microarray data analysis and discussed the applications of PLS in gene expression microarray data. First, we used the VIP (Variable Importance in Projection, VIP) value to select the differentially expressed genes, and then checked these selected genes from biological aspect. Second, for the importance of cytogenetic aberrations identification, we used these selected genes to detect the cytogenetic aberrations of hepatocellular carcinoma by Fisher exact test. Third, because of the importance of multi-classification for sub-type tumor classification, we compared two multi-classification methods based on PLS from accuracy and reliability. Methods:â‘ Used the VIP value to detect the differentially expressed gene;â‘¡Used the selected genes from step one, and located all the genes on the chromosomes, then counted down and up-regulated genes on every chromosome. Finally, Fisher's exact test was used to detect the significant cytogenetic aberrations regions.â‘¢Four different significant gene selection methods were performed on four real tumor microarray gene expression data, and then two classification methods were carried out. Incomplete leave one out cross validation was used to evaluate the gene selection methods and classification model. The best gene selection method entered the following step. Under the best gene select method, total cross validation was used to evaluate the accuracy and reliability of two classification method.Result:â‘ Significant genes of HCC can be seleceed effectively by VIP value. Using the algorithm proposed in this reseach, 15 regions of cytogenetic aberrations were identified; all of them were proved by experiments. Compared with traditional experiment analysis and predictive algorithms, partial least squares method combining with Fisher's exact test was an effective and simple algorithm for identifying cytogenetic aberrations. Second, two multi-classification methods can reach to a good accuracy. Evaluated by incomplete cross validation, differentially expressed genes selected by VIP value had the best effect to the two classification methods. Top 200 genes collected by VIP value and on the 4 components level, PLS-LDA and PLS-DA had quite good results, and the error rates were low. This proved that it was rational to use VIP value to select differentially expressed genes for the following research. In order to evaluate these two classification methods, total cross validation was employed under the different k fold. The average error rate and the difference between t-LOOCV and t-2-fold CV showed that PLS-DA outperformed PLS-LDA on accuracy and reliability.
Keywords/Search Tags:Partial lesat squares, Cancer, Gene expression microarray, Differentially expressed genes, Cytogenetic aberrations, Multi-classification, Cross validation
PDF Full Text Request
Related items