Font Size: a A A

Computational Identification And Analysis Of Cancer Biomarkers Based On Expression Data

Posted on:2016-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z B CaoFull Text:PDF
GTID:1228330467495431Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Cancer is a complex disease contained in various tissues of human body, accompanyingwith many changes and mutations in the genomes. Cancer is a main cause of death all overthe world. Although scientists and doctors have been fighting against the cancer relateddiseases, there is still no efficient way to cure cancer. Another aspect, cancer is class ofdiseases and there are many kinds of cancers in human tissues, such as breast cancer, lungcancer, liver cancer and so on. Some of the cancer types are fatal, such as pancreatic cancer,and some are chronic, such as prostate cancer which is a slow growing one. Facing such akind of complex and difficult disease, early detection of cancer is becoming more and moreimportant and necessary. Early detection is the key for longer survival time for all majorcancer types. The most efficient way to detect cancer in an early stage is to detect thebiomarkers of certain cancers. In recent years, the identification of biomarkers from bodyfluids, such as serum, plasma, saliva, or urine is becoming hot topics in the cancer relatedstudies due to its non-invasive way to regular physical examination.Recent advances in microarray and sequencing techniques have generated new hope foridentifying effective markers for early detection of cancer. There are more and moreexpression data can be easily acquired now. The expression data is usually presented as a datamatrix which contains tens of thousands of genes and only a few samples. So how to selectthe informative gene subset from such a high dimensional data matrix is often a muchtroublesome problem. Feature selection is a technique that could extract relevant featuresfrom the whole feature set and may get rid of the influence of the redundant and irrelevantfeatures. In general, most feature selection methods can be divided into three categories: filtermethods, wrapper methods, and embedded methods. Filter methods only depend on theintrinsic properties of data to select the relevant features. In the last decade, feature selectiontechnique has become an important tool for lots of bioinformatics applications by microarraydata, such as cancer classification, biologic network inference, expression correlation analysisand disease biomarker identification.In spite of so many approaches proposed in these years, there are very few methodsconsider the importance of the paired samples obtained from the same patients. On the otherhand, although many biomarkers have been reported in some cancer related researches, onlyone or very few biomarkers are identified in each experiment which is involved in a singlecancer type. The accuracy and specificity of some biomarkers are not so promising. So in ourwork, we mainly apply the computational methods to identify and make analysis of cancer biomarkers based on expression data. The main contributions are as follows:1. A novel filter feature selection method for paired microarray expression data analysisThe purpose of the proposed method is to identify the significant genes from the pairedgene expression data. The method also eliminates the irrelevant genes and selects the relevantgenes by considering the effect of paired samples and the correlation of genes. The method isdescribed as follows:i. An improved paired t-test method is used to calculate the statistical significant ofgenes between the paired samples. We improve the original paired t-test by using foldchange value of normal samples and cancer samples instead of using subtractionbetween them.ii. The statistical significant is measured using the q-value of False Discovery Rate inplace of the original p-value.iii. The influence of the redundant genes is reduced by using Pearson correlationcoefficient.The performance measurement process is done on the four aspects:1) the performance of classification capability in single dataset and multiple datasets,respectively;2) the stability of the gene list assessment, about the ratio of common genes and thetotal genes;3) the functional stability assessment, which is about the GO term information statisticalanalysis;4) the functional enrichment comparison analysis.Six cancer gene expression datasets from GEO are chosen to make the comparison uponseven methods on the effectiveness and stability of the gene lists. When evaluated on singledatasets, the mean accuracies of the method by Support Vector Machine (SVM) classifier onthese datasets for top100genes are86.00%,94.78%,99.77%,92.90%,98.66%and86.60%,respectively. And when evaluated on different datasets of the same cancer, the meanaccuracies of the method by SVM classifier on these datasets for top100genes are81.85%,93.92%,94.27%,60.63%,98.65%and68.89%, respectively. Then, we also evaluate thestability of gene ranking lists, the functional stability and also the functional enrichmentanalysis. The results of our proposed method are more stable than the other methods. Theexperimental results show that the proposed method has an applicative capability of featureselection for microarray expression data analysis.2. An integrated computational analysis for microRNA biomarkers identification basedon microRNA expression dataThe comparative analysis is made on the miRNA expression patterns in cancer versusnormal samples of eight prevalent cancer types. The work is focus on two aspects, the specialmarkers for single cancer type and common markers across multiple cancers identification,including: a) Identification of differentially expressed miRNAs for each cancer type; b)Differentially expressed circulating miRNAs identification; c) Combination markers selection; and d) Enriched pathway analysis of miRNA target genes.Firstly, for each cancer type, we give a comprehensive statistical analysis of thedifferentially expressed miRNAs, and calculate the number of the up-and down-regulatedmiRNAs. For prostate and stomach cancers, the numbers of up-regulated miRNAs are morethan the down-regulated miRNAs. It is opposite in thyroid and liver cancers. Then, we makethe analysis of the special and common miRNA identification, respectively.For the special miRNAs of a certain cancer analysis, we first identify the circulatingdifferentially expressed miRNAs. The number of the up-regulated circulating miRNAs is alsomuch more than down-regulated miRNAs in prostate and stomach cancers, and in thyroid andliver cancers, the majority is down-regulated. Then, the specific miRNA markers for eachcancer are identified. hsa-mir-30b is one of the specific miRNAs that have up-regulation inprostate cancer. The combination markers are then selected for special cancer type. Wecalculate the distinguishing ability of k-miRNA (k=1,2,3,4) combinations in single andmultiple datasets evaluation, respectively. In liver cancer, the best four-miRNA combinationof hsa-mir-(10a+130a+15a+30d) could get100%,91.44%and95.71%accuracies in singledataset, multiple datasets and the mean value evaluation processes, which may be a goodindicator for liver cancer.For the common miRNAs across multiple types of cancer analysis, firstly, thedifferentially expressed miRNAs across eight cancer types are selected. There are76,37and14miRNAs identified with great differential expression across at least four, five and six kindsof cancers. Some of these miRNAs are up-regulated such as hsa-mir-21, hsa-mir-93andhsa-mir-182, and some are down-regulated such as hsa-mir-195and hsa-mir-139. Most ofthese miRNAs have been reported to be relevant to cancer. Further, the circulating miRNAsacross multiple cancers in early stage are selected.39,17and9circulating miRNAs areobtained to be differentially expressed across at least four, five and six kinds of cancers, mostof which have been referred to several kinds of cancers. Then, the combination markers formultiple cancer types are measured. The classification capabilities of k-miRNA (k=1,2,3,4)are evaluated among eight cancer types. The best three and four miRNA markers arehsa-mir-(183+21+411) and hsa-mir-(136+182+21+744), which could get mean classificationaccuracy of91.94%and94.19%across eight cancers, respectively. Finally, the enrichedpathway analysis is done on the39common miRNAs across multiple cancer types throughmiRSystem database. Some biological processes such as cell proliferation, differentiation andmigration are involved, and some pathways related to certain cancers are enriched such asprostate cancer pathway, pancreatic cancer pathway and renal cell carcinoma pathway.
Keywords/Search Tags:Feature Selection, Paired Microarray Data, Differentially Expressed Genes, CancerGene/miRNA Biomarker, Special Biomarker, Common Biomarker, Pathway Enrichment
PDF Full Text Request
Related items