Font Size: a A A

Some Statistical Problems In Longitudinal Analysis Of Genetic Epidemiology: Data Analysis And Dimensionality Reduction

Posted on:2013-05-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q Q PengFull Text:PDF
GTID:1100330464960892Subject:Genetics
Abstract/Summary:PDF Full Text Request
The main content of the research is applying and developing statistical methods to solve problems appeared in longitudinal study of genetic epidemiology. For each subject, study design and data collection had been completed routinely. But there are still some problems with data analyzing, such as missing data, bias, and methods for data analysis. In this paper, we dwelled on several problems appeared in longitudinal study of genetic epidemiology, and developed or explored statistical methods to solve them. These problems include missing data in longitudinal data analysis, survival bias, and high dimensional data analysis in longitudinal study.Missing data is a common problem in almost every research study, and there are abundant methods for dealing with it. Different study may need different methods to handle missing data problem. In our study, the difficulty in 7-day neonatal bilirubin data was that the fixed effect of phototherapy coupled with missing data problem. After consideration, we applied mixed model to deal with the two problems in 7-day neonatal bilirubin data. A series of mixed model with different time parameters and phototherapy effect were constructed. Based on model penalty parameters (-2LL, AIC, AICC and BIC) and the statistic developed by us (T102), the optimal mixed model was selected for missing data imputation and phototherapy effect correction. After missing data imputation and phototherapy effect correction, the characteristics of the data are familiar to that found by other researches. The problems in this study are typical in future as inevitable intervention brings difficulty in observational studies. We hope this study would offer help in other related studies.Survival bias is generally considered as a death-caused deviation from study design. In recent years, a few researches had shed light on survival bias in genetic epidemiology. But it still lacks of convenient methods for estimating or adjusting survival bias in different studies. In our study, we penetrate into the origin of survival bias based on previous researches on survival bias. We depicted the relationship between theoretical population and realistic population by controlling all other factors apart from natural death and death according to research factor. And take the statistic, OR, as an example, we investigate into the relationship between a sample randomly selected from realistic population and that from theoretical population. Based on the relationship between realistic ORE and theoretical ORT, we constructed a statistic ORc to estimate theoretical ORT. And we also developed a hypothesis testing method for testing the significance of ORc. Further, we applied our method to the association study of UGT1A1 gene polymorphisms and coronary heart disease, to estimate and test the effect of survival bias on the study.High dimensionality is an important issue in statistical study. Various methods have been developed to handle high dimensional data. However, the methods for dealing with high dimensional data in longitudinal study are scarce. In this study, focus on two types of high dimensional data in longitudinal study, we explored statistical methods for treating dimensionality reduction and data analysis according to different research objectives.Surface-enhanced laser desorption/ionization spectrum (SELDI) is a proteomics technology. The SELDI data is a time-related protein expression data. Autocorrelation is an important characteristic in the data. However, the existing methods for dealing with SELDI data always neglect it. In this study, a SELDI data was gotten from 71 lung adenocarcinoma patients and 24 healthy controls. We constructed a classification model based on dimensionality reduction and characteristic extracting for statistical diagnosis of lung adenocarcinoma. The classification model has taken the autocorrelation of the SELDI data into account, and made a reliable and efficient statistical diagnosis of lung adenocarcinoma. Through cross validation analysis, we further demonstrated that the classification model developed in this paper outperformed peak-selection based methods which were commonly applied to SELDI data analysis. Our method could be extended to more studies and applied in other high throughput data analysis.We had also explored statistical methods for dealing with multi-variable high dimensional data of longitudinal study in this paper. A high altitude acclimatization data was collected in this study, including 23 physiological measurements at three time points,22 genotype polymorphisms in four genes and several epidemiological factors. In this data, there are not only autocorrelation in each physiological measurement, but also correlation among these measurements. We applied mixed model and partial least square path model to analyze the high altitude acclimatization data. Mixed model was applied to explore the relation between changes of single physiological measurement and epidemiological factors or genotype polymorphisms during three periods of high altitude acclimatization. Partial least square path model was applied to extract latent variables from the physiological measurements, and to explore the relation between changes of latent variables and epidemiological factors or genotype polymorphisms. The results of data analysis showed that mixed model and partial least square path model gave similar results.In this paper, we dwelled on several problems appeared in longitudinal study of genetic epidemiology, and applied statistical theory and methods to solve them. This study would help follow-up studies to avoid or mitigate such problems in study design, data collection and data analysis. Most importantly, the methodology in this paper would support help in dealing with more problems occurred in longitudinal study or other epidemiological study.
Keywords/Search Tags:longitudinal study, missing data, survival bias, high dimensionality, genetic epidemiology
PDF Full Text Request
Related items