Font Size: a A A

Phenotype And Microarray Data-based Clustering Analysis Of Genotypes Or Genes

Posted on:2008-11-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XiaoFull Text:PDF
GTID:1100360215474523Subject:Crop Genetics and Breeding
Abstract/Summary:PDF Full Text Request
Segregation analysis (SA) is a statistical genetic method directly using the phenotype of quantitative traits in segregation population to detect the existence of major genes and estimate their effects. It serves as an important tool in helping investigators to plan further studies such as quantitative trait loci mapping or more sophisticated genomic analyses. Under the assumption that the major gene effects and polygenic effects are independent, the individuals with the same major gene genotype are expected to be normally distributed, whereas individuals with different major gene genotypes could follow a mixture of normal distributions with different means and the same variance. Therefore, the estimation of major gene effects and genetic hypothsis testing in SA were implemented through the construction of Gaussian mixture model, the maximum likelihood (ML) estimation of parameters and the calculation of the likelihood ratio test (LRT) statistics.However, current methods of SA for a single trait typically have low statistical power. In this study, we propose a joint analysis method for multiple traits, i.e., multivatiate segregation analysis (MSA) that takes advantage of the genetic and residual correlation information of multiple quantitative traits to detect major genes. It is hopeful that this method not only increases the statistical power, but allows dissection of the genetic architecture underlying the trait complex. In MSA the observed phenotypes of multiple correlated traits are fitted to a multivariate Gaussian mixture model. The separated proportion, major gene effects and residual variabilities are estimated under the ML framework via the expectation-maximization (EM) algorithm. Various genetic hypothesis tests of major genes are tested using LRT statistics. Pleiotropy is distinguished from close linkage by comparing three possible models using the Bayesian information criterion (BIC). Three models are the complete pleiotropic model, the linkage model and the non-linkage/independent model respectively. Two simulation experiments were performed based on the F2 mating design to validate the feasibility of this method. In the first, the statistical powers and the accuracy and the precision of genetic effects along with residual variabilities of MSA under varying heritabilities and sample size were investigated. In the second simulation the efficacy of MSA in separating pleiotropy from close linkage under varying heritabillities was demonstrated. The results of extensive simulation showed (1) MSA increases the statistical power of major gene detection, due to MSA made best use of the correlation among traits, whether the simultaneous monitoring the expression of multiple traits or only monitoring the expression of a single trait among these traits by major gene. (2) MSA improves the precision and accuracy of major gene effect estimates. In general, if only the statistical power of major gene is higher than 50%, the precision and accuracy can arrive at the ideal value. (3) The efficacy of MSA to separate pleiotropy and close linkage was demonstrated. (4) Although both the heritability and sample size are key factors affecting the statistical power in the detection of major genes, it was found that the statistical power can be much better improved with the increased heritability than sample size. An example of the plant height and tiller number of F2 population in rice cross Duonieai×Zhonghua 11 was used in the illustration. The results indicated that the genetic difference of these two traits in this cross involves only one pleiotropic major gene. The additive effect and dominance effect of the major gene are estimated as -21.3cm and 40.6cm on plant height, and 22.7 and -25.3 on number of tiller, respectively. The major gene shows overdominance for plant height and close to complete dominance for number of tillers.The above MSA not only estimates the genetic parameters in model, but also can calculate the posterior probabilities of each individual belong to different major genotypes. Thus, in this paper, we introduced a new method, namely model-based unsupervised dynamic clustering method, which classified individuals according to the Bayesian posterior probabilities. In this method the parameters of different clusters were also estimated by the ML method implemented via EM algorithm and the individuals were classified by the Bayesian posterior probabilities. The outcomes of the simulation experiments clearly demonstrated. (1) The proposed method not only unbiasedly estimated the corresponding cluster parameters but also determined the optimum clustering numbers by BIC, which solving the great dilemma of deciding the number of cluster in traditional dynamic cluster methods. (2) Compared with the k-means method and the minimum square sum within groups (MinSSw) method, the proposed method was more robustness. (3) Moreover, the misclassified rate (MR) could be reduced by using stricter discrimination criterion. The proposed method was further validated by Fisher's Iris dataset and the result indicated that the unsupervised dynamic cluster method implemented through the maximum of the likelihood function especially fits the data generated from Gaussian distribution, because the proposed method had a significant lower MR compared to the k-means and MinSSw methods.DNA microarray technology is the chief tool for functional genome research in the post-genomics era, which allowed the simultaneous monitoring of expression levels in cells of thousands of genes under varying experimental environment or biological tissue. Grouping gene having similar expression patterns is called gene clustering, which has been proved to be a useful tool for extracting underlying biological information of gene expression data. Also, it is the useful and most widely used method of microarray data analysis. Depending on whether or not the prior knowledge is used, the clustering methods could be classified into unsupervised clustering and supervised clustering. To explore the feasibility of the application of the above model-based cluster method to the analysis of high-dimension Microarray expression data, several typical supervised clustering methods, i.e., Gaussian mixture model-based supervised clustering, k-nearest-neighbor (KNN), binary support vector machines (SVMs) and multicategory support vector machines (MC-SVMs), were employed to classify the computer simulation data, yeast cell cycle microarray data and 60 human cancer cell lines (NCI-60) microarray data. False positive, false negative, true positive, true negative, clustering accuracy and Matthews'correlation coefficient (MCC) were compared among these supervised methods. The results are as follows. (1) In classifying thousands of gene expression data, the performances of model-based cluster methods have the maximal clustering accuracy. Furthermore, when the number of training sample is very small, the clustering accuracy of model-based supervised method have superiority over model-based discrimination method only using the information of known functional gene to guide the classified of unkonw functional gene, whereas the former simultaneous using the prior knowledge of known functional genes and unknown functional genes to guide the classified of unknown functional genes. But insofar as the computational speed was concerned, discrimination method is quicker than model-based method. (2) In general, the superior classification performance of the MC-SVMs is more robust and more practical, which are less sensitive to the curse of dimensionality and not only inferior to model-based method in clustering accuracy to thousands of gene expression data, but also more robust to a small number of high-dimensional gene expression samples than other techniques. (3) Of the MC-SVMs, OVO and DAGSVM perform better on the large sample sizes, while five MC-SVMs methods have very similar performance on moderate sample sizes. In other cases, OVR, WW and CS yield the better results when sample sizes are small. (4) We recommend that at least two candidate methods choosing based on the real data features and experimental conditions should be performed and compared to obtain better clustering result.
Keywords/Search Tags:Multiple correlated quantitative traits, Major gene, Multivariate segregation analysis, Maximum likelihood estimation, EM algorithm, Cluster analysis, Microarray, Supervised clustering
PDF Full Text Request
Related items