Font Size: a A A

Several Statistical Issues In The Analysis Of High-dimensional Biological Data

Posted on:2008-12-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:P C XunFull Text:PDF
GTID:1104360215463401Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
The development of microarray technologies has revolutionized biomedical research.It allows investigators to measure the expression levels of several thousands of genes orproteins simultaneously. Meanwhile, along with the rapid progress in molecular medicineand its related disciplines, a large amount of genome sequence data has been accumulated.With the help of statistical methods, three statistical issues including differentialexpression analysis, discriminate analysis of microarray data and comparison of candidatepartial genomic regions in their representatives of full-length genome were studied fromapplication point of view, and the research details were as follows:In Section 1, two simulated studies were conducted based on the publicly availablemicroarray data set: colon data, containing 2000 human genes. For the first simulation, thedata was generated form normal distributions with the variances estimated from the datawithin groups under the assumption of independence. For the second one, the genes wereassumed to correlate as observed in real data and the data was generated from the originaldata with a resampling technique, stratified bootstrap. Based on the two simulation studies,four FDR-controlling procedures, that was BH, BL, BY and ALSU, were evaluatedthoroughly:(1) Under the assumption of independence, the four procedures can indeed control theFDR below the pre-specified level and can provide adequate power at most situations. Inaccordance with the controlling strength of FDR, the four procedures are ranked asBL>BY>BH>ALSU. and the powers as ALSU>BH>BY>BL.(2) When the dependent structure is simulated from the real data. the four procedures either under-controlled the actual FDR or over-controlled the FDR at the sacrifice of thepower, the most likely scenarios, and the four procedures even lost power when the samplesize of each group is below 20.In addition, a statistical strategy, that was "feature pre-selection→global test→single variable test→partial multivariable test", was put forward and found to performwell on the real data from a "differential expression of sperm protein between normalfertile men and asthenozoospermic patients" study. And a combination of ten proteins wasdetected to be a subset list of"truly differentially expressed proteins" between two groups.In Section 2, three simulation studies under different levels of "true" prediction errorswere implemented, and nine chosen methods such as k-fold cross-validation, bootstrap andleave-one-out bootstrap were given a critical evaluation. As both bias and mean squareerror(MSE) are considered, the behaviors of 3-fold cross-validation and 5-fold one arerobust across all investigated situations and are preferred.A discriminate strategy of "feature pre-selection→further dimension reduction→step-wise discriminate selection→model construction→model validation" waspresented with the colon data and its validity was further verified by two more datasets.In Section 3, to develop statistical methods for comparison of five candidate partialgenomic regions in their representatives of full-length genome of Hepatitis E virus forgenotyping based on sequenced data of 71 Hepatitis E viruses (HEV) strains at hand, threemethods were constructed as follows:Firstly, modified Korin's statistic was used to compare the candidate fragments withthe whole genome, followed by a 50%stratified sampling strategy to validate the stabilityof the results and the leave-one-out (LOO) method to assess the sensitivity of ourprocedure to each strain. Through simulations, it was statistically indicated that fragmentⅢmight be a representative region for whole genome of Hepatitis E Virus for genotyping.Secondly, the distributions of the eigenvalues from the six similarities matrices wereobtained with LTO method, and Mahalanobis distance was directly used to measure thedifferences between each five region and the whole sequence, which also indicated thatfragmentⅢwas the most representative one. Thirdly, a score statistics was constructed and its empirical distribution under nullhypothesis was gotten for testing through Monte Carlo simulation, which indeed came tothe same conclusion from the statistical inference point of view.Based on both the full-length sequence and fragmentⅢ, the same four genotypeswere identified by the subsequent phylogenetic analysis and similar ranges of meannucleotide difference were also observed to differentiate HEV sequences at three levels:genotype, subtype and isolate, which further verified the statistical results.In conclusion, three suggestions can be made:(1) The strategy "feature pre-selection→global test→single variable test→partialmultivariable test" is a practical one of differential expression analyses for microarraydata.(2) The strategy "feature pre-selection→further dimension reduction→step-wisediscriminate selection→model construction→model validation" is not limited totwo-class classification, but indeed applicable to the multi-category case.(3) The score method is an effective one for comparing partial genomic regions withfull-length genome of HEV strains for genotyping statistically.The above strategies and methods answered the questions from the biologists quitewell and deserved to be explored widely in future work.
Keywords/Search Tags:microarray data, differential expression, false discovery rate, discriminate analysis, prediction error, statistical strategy, hepatitis E Virus, genotyping, full-length genome, partial genomic regions, score method
PDF Full Text Request
Related items