Differentially expressed gene analysis and tumor sample classification greatly benefit therapeutic development and facilitate application of precision medicine on patients.However,solid tumor tissue obtained from clinical settings is the non-cancerous cells present in and around a tumor,including normal tissues,infiltrating immune cells,stromal and blood vessels.In addition,the incorporation of normal cells may have an adverse effect on differentially expressed gene analysis and tumor sample classification.Therefore,establishing appropriate statistical models with the consideration of tumor purityare immediately needed for differentially expressed gene analysis and tumor sample classification.In this thesis,we performed systematic research on these two problems in computation biology.First,we studied the effect of tumor purity on the analysis of differentially expressed gene.Simulation studies demonstrate that tumor purity has multiplicative effect on differential expressed,instead of additive.So ignoring tumor purity for differentially expressed or differentially expressed with the consideration of tumor purity by using tumor purity as an additive covariate gives biased results.To solve the problem,we design the method,based on a generalized least square procedure and Wald test,to test the difference between normal and tumor samples for each gene.The analyses of TCGA data demonstrate that the proposed method provides more improved results both in the number of differentially expressed gene,the consistency of the test statistics across different cancer types and the functional relevance to cancer types compared with t-test and limma.Second,we systematically investigated the impact of tumor purity as a confounding factor in unsupervised clustering of tumor samples,and proposed a statistical model to adjust purity effect in tumor sample clustering.We first found that under traditional k-means and NMF approaches,tumor purities bias the clustering results,samples with similar purities are likely to cluster together,and tumor samples with low purities tend to be misclassified.To overcome the problem,we designed a model-based statistical method for subtype classification based on DNA methylation data.In our method,methylation levels from tumor samples at each CpG site are modeled as mixture of normal distributions.Parameter estimation and sample clustering are performed through an EM type algorithm.Based on simulation,InfiniumClust achieved more robust and accurate results compared with k-means.When applying to real TCGA tumor samples,InfiniumClust obtained the least biased clusters comparing with k-means and the well-established NMF method. |