Font Size: a A A

The Application Of Mathematical Modelin And Data Mining Methods In Research Of Colorectal Cancer Metastatic Mechanism

Posted on:2013-02-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:X B LiFull Text:PDF
GTID:1114330371984720Subject:Pathology and pathophysiology
Abstract/Summary:PDF Full Text Request
Colorectal cancer (CRC) is one of the most common types of cancer. In2007, it was estimated that nearly1.2million new cases of CRC were diagnosed globally, and about630,000deaths from CRC were estimated to occur, accounting for8percent of all cancer deaths. The vast majority of CRC deaths are due to the metastasis. CRC is highly curable when it is diagnosed at an early stage. However, CRC is less likely to be curable when it is detected at an advanced stage (when distant metastasis occurs). When CRC is confined to the colon or rectum, the five-year survival may be as high as90percent. Five-year survival rate is68percent for CRC patients who are diagnosed at the regional stage. Five-year survival rate for CRC patients with metastasis is as much lower as11percent. It is estimated that approximately60%CRC patients will eventually develop with metastasis.CRC metastasis, as the advanced stage of colorectal tumorigenesis, is a complicated, multi-step biological process. CRC metastasis has rarely been systematically addressed by previous studies, and the molecular mechanism remains far from being completely elucidated. Multiple oncogenes and tumor suppressor genes participate in the process of CRC metastasis. Identification of colorectal metastasis genes helps to set up new detection methods, determine optimized clinical therapy protocol, develop targeted therapy drugs and improve the ability to judge the prognosis.Chromosomal abnormalities are usually considered to be an important feature of cancer cells, and various types of chromosomal abnormalities and aneuploidy are frequently observed in nearly90%of the human malignant tumors. At present, laboratories and public databases have accumulated a large amount of chromosomal aberration data, enabling statistical analyses to establish mathematical models. In this study, we summarized several practical applications of mathematical models:tree model, Bayesian network model, multivariate analysis model. The advantages and disadvantages of these models were compared, and the principles, methods and applications of these models in the study of the molecular mechanism of tumor development were discussed. In general, because these models are from different mathematical background, they have their inherent advantages and disadvantages, and in this case, we might integrate these models to gain further insight into tumor development. These findings, inferred by these models, suggest clues for subsequent molecular biology experiments, thus enabling better understanding on the genetic changes that occur in human cancer cells.In order to explain the molecular mechanism of CRC development, in1988, Vogelstein proposed a classical linear model, which has been widely recognized by researchers. However, recent studies have revealed the genetic heterogeneity of CRC tumors, thus challenging the classical pathway model of CRC development. The CRC development allows multiple, nonlinear processes, rather than a single, linear pathway. Prompted by this, Desper et al. designed oncogenetic tree models to capture the heterogeneity of tumors, aiming to define the patterns of chromosomal abnormality in solid tumors and establish whether any associations exist between chromosomal aberrations.In this study, we collected a total of244cases of CRC comparative genomic hybridization (CGH) data in9published articles, and successfully constructed oncogenetic tree models for CRC. We identified the6most common gains of chromosomal regions of7p (37.0%),7qll-32(34.8%),8q (48.3%),13q (49.1%),20p (36.1%) and20q (50.4%), and the9most common losses of1p13-36(30.9%),4p15(24.3%),4q33-34(24.3%),8p12-23(50.9%),15q13-14(23.5%),15q24-25(24.3%),17p (34.8%),18p (36.5%) and18q (61.7%) in CRC. Based on the analysis of the branching tree and the distance-based trees, we classified sporadic CRC into two distinct groups:one preceding with-8p12-23, and the other with+20q. The sample-based classification tree also demonstrates that colorectal cancer can be classified into multiple subtypes marked by-8p12-23and+20q.By comparing the15common chromosomal abnormalities between primary and metastatic colorectal cancer, we identified5potential metastatic pathways:(-18q,-18p),(-8p12-23,-4p15,-4q33-34),(+20q,+20p),(+20q,+7p,+7qll-32) and+8q. A data matrix was created based on the distribution of the15selected important events, and a classifier built on the data matrix was used to distinguish between primary CRC and CRC metastasis. A feature selection method was performed, and-8p12-23and+20q were selected as the two most informative events. We counted distribution of these two events present in primary and metastasis CRC cases, and found79out of85(92.9%) metastatic cases were marked by-8p12-23or+20q, implying that-8p12-23and+20q are the two marker events of CRC metastasis.A multitude of studies have demonstrated the potential for gene expression profiles to classify different tumor types, for diagnostic and prognostic purposes. Gene selection is a key technology in classification with microarray data. Guyon et al. proposed support vector machine recursive feature elimination (SVM-RFE) algorithm to recursively remove genes based on their weights in the support vector machine (SVM) classifiers and classify the samples with SVM. SVM-RFE approach for gene selection has recently attracted many researchers. In this study, we propose a novel SVM-RFE based gene selection algorithm (support vector machine and t statistics recursive feature elimination, SVM-T-RFE) by incorporating Student's t statistics.We compared the performance of SVM-T-RFE and SVM-RFE gene selection algorithms on the five public microarray datasets. SVM-T-RFE achieved the same performance (100%accuracy) as SVM-RFE, but using smaller subsets of probe sets in colon (n=5vs n=9), lymphoma (n=3vs n=5) and prostate datasets (n=5vs n=6). For Leukemia and Medulloblastoma datasets, since SVM-RFE algorithm obtained100%prediction accuracy by using two (Medulloblastoma) or three gene probes (Leukemia), there is little room for improvement, and we obtained100%prediction accuracy by using similar number of gene probes.We obtained55early stage primary CRCs (Pathological Stage:0or1; Group1),56late stage primary CRCs (Pathological Stage:4; Group2), and34colorectal metastatic cancer (Group3) from Gene Expression Omnibus website. The gene expression profiles were obtained by using Human Genome U133Plus2.0Array (Affymetrix, Inc.), which contains54,675probe sets. To discover colorectal metastasis-related genes, we classified samples between early and late stage CRCs, and Group1(Pathological Stage:0or1, n=55) and Group2(Pathological Stage:4, n=56) were combined together as PRI dataset. Furthermore, comparisons were made between late stage CRCs and metastatic samples, and Group2and Group3(n=34) were merged into META dataset.Gene selection algorithm yields ordered feature set, in which genes rank from high to low scores. We selected a subset of top200ranking probe sets. We used the leave-one-out cross-validation (LOOCV) method to assess performance of the classifiers, and reduced the number of top ranking probe sets from200to1. The subset of genes, obtained in PRI dataset by using SVM-RFE method (PRI-GS-1gene subset), achieved100%accuracy with12probes. SVM-T-RFE method was also applied to PRI dataset, and PRI-GS-2subset was obtained. The PRI-GS-2gene subset yielded accuracies of100%, by using smallest number of probes (n=10). In META dataset, the subset of genes, obtained by using SVM-RFE method (META-GS-1gene subset), achieved100%accuracy with10probes, and the subset of genes, obtained by using SVM-T-RFE method (META-GS-2gene subset), achieved100%accuracy with6probes. SVM-T-RFE outperformed SVM-RFE method in PRI and META datasets in term of classification performance.PRI-GS-1subset contains20significant probes (P<0.05), while PRI-GS-2subset contains132significant probes. META-GS-1subset contains15significant probes (P<0.05), while META-GS-2subset contains29significant probes. SVM-T-RFE method was able to detect more differentially expressed genes (DEGs) than SVM-RFE method.Published gene expression data in Jorissen et al.'s study was also used for comparison of the SVM-RFE and SVM-T-RFE algorithms. The dataset contains364CRC samples, including86Dukes stage A,94Dukes stage B,91Dukes stage C and93Dukes stage D CRCs. All of the samples were analyzed by using Human Genome U133Plus2.0Array (Affymetrix, Inc.), which contains54,675probe sets. SVM-RFE and SVM-T-RFE algorithms were applied to classify between the86Dukes stage A and93 Dukes stage D samples. SVM-T-RFE method yielded accuracies of100%, by using16probes, and SVM-RFE method achieved100%accuracy with21probes. We built the classifiers by using the16probes selected by SVM-T-RFE algorithm, and94Dukes stage B CRCs were then classified into two groups of good prognosis (stage A-like) and poor prognosis (stage D-like). Kaplan-Meier curves were used to estimate the cumulative distribution of disease free survival, and patients predicted as stage A-like were observed to have a better outcome compared with patients predicted as stage D-like (log-rank P=.019). The Kaplan-Meier analysis revealed that the16probes were associated with the poor prognosis in Dukes B patients.SVM-T-RFE method was shown to bestow SVM-RFE method in two aspects: firstly, highest prediction accuracy is achieved using equal or smaller number of selected genes. Secondly, more differentially expressed genes are contained in the subsets of selected genes. Our results revealed the predictive power of microarray technologies. A fraction of selected genes are associated with CRC development or cancer metastasis, although many others need to be confirmed in further experimental test.In recent years, due to the rapid development of molecular biology experimental techniques, a large amount of data has been accumulated, including genome, transcriptome and proteome detection platform. Previous studies often focus on data from a single platform, and rarely address the problem of integration of data from a variety of platforms. DNA copy number changes will impact on oncogenes and tumor suppressor genes. It is commonly agreed that the loci of chromosomal losses encompass tumor suppressor genes and that the loci of chromosomal gains are relevant to amplification of oncogenes.By combining with microarray data, we used the integration strategy for mining the CRC metastasis-related genes. In PRI dataset, we found that the genomic DNA copy number has a direct impact on gene expression value. We selected161overlapping probes in PRI dataset. SVM-T-RFE algorithm was applied to select the minimum number of genes (n=14) and achieved highest prediction accuracy (100%). In META dataset,70overlapping probes were selected. The minimum number of genes (n=14) was selected by SVM-T-RFE algorithm, and highest prediction accuracy (100%) was achieved. Our results demonstrated that integration analysis is an effective strategy for mining cancer-associated genes.Based on the above results, the following conclusions can be drawn:1. We collected a total of244cases of CRC CGH data and constructed oncogenetic tree models for CRC. We identified15common chromosomal abnormalities. We classified sporadic CRC into two distinct groups:one preceding with-8p12-23, and the other with+20q.2.-8pl2-23and+20q are the two marker events of CRC metastasis. We identified5potential CRC metastatic pathways:(-18q,-18p),(-8pl2-23,-4p15,-4q33-34),(+20q,+20p),(+20q,+7p,+7q11-32) and+8q.3. We proposed SVM-T-RFE gene selection algorithm. SVM-T-RFE method was shown to bestow SVM-RFE method in two aspects:firstly, highest prediction accuracy is achieved using equal or smaller number of selected genes. Secondly, more differentially expressed genes are contained in the subsets of selected genes.4. A fraction of selected genes are associated with CRC metastasis. CRC metastasis involves multiple biological processes. Multiple oncogenes and tumor suppressor genes participate in the process of CRC metastasis.5. The gene subsets selected by using an integration analysis method are associated with CRC metastasis.
Keywords/Search Tags:colorectal cancer, metastasis, mathematical model, tree model, comparativegenomic hybridization, microarray, gene selection, integration analysis
PDF Full Text Request
Related items