Font Size: a A A

Framework Construction Of ’Comparative Study Of Statistics’ And Demonstration Research On Regression Analysis

Posted on:2015-09-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L BaoFull Text:PDF
GTID:1224330431473907Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
【Objective】 To overcome the current situation that the similar statisticalmethods are arbitrarily used in practice, lacking systematic and scientific theoreticalstudy, this paper aims to establish a platform for comparative research of Statistics,construct a scientific and systematic framework of comparative study of Statistics,and conduct a deep and thorough study on three sub-topics of regression analysis,hoping to set an example for subsequent research; meanwhile, invite experiencedexperts at home and abroad to participate in this meaningful research in order topromote the prosperity and development of comparative study of Statistics.【Content】 First, this study constructs the framework of comparative study ofStatistics concerning almost every aspect of Statistics, emphasizing on the comparisonof similar statistical ideas, theories and methods.After the framework has been constructed, three sub-topics on regression analysisare given deep and thorough comparative study, namely, the comparison of variousmethods dealing with missing data in repeated measurement data, the comparison offour robust regression methods and the comparison of six variable selection methodsin multi-regression analysis as well as the automatic SAS realization.In the comparative study of various methods dealing with missing data inrepeated measurement data, the deletion method, single imputation method andmultiple imputation method are studied and compared. In the research of robustregression methods, the four most commonly used methods, namely, the Huber Mestimation, LTS estimation, S estimation and MM estimation are comparedconcerning their robustness and efficiency. In the study of six variable selectionmethods, the forward, backward, stepwise, R2, adjusted R2and Mallow’sCpselection methods are compared; meanwhile, SAS programs are compiled toautomatically realize variable selection using multiple selection methods and outputof the best fitting model in multi-linear regression analysis and multi-logisticregression analysis. The feasibilities of the programs are tested and verified by realexamples.In addition, the paper formulates invitations both in Chinese and English to inviteexperienced experts at home and abroad to participate in the study, aiming atgathering the wisdom and power of more experts to promote the prosperity anddevelopment of comparative study of Statistics. 【Methods】 In order to construct a scientific and systematic framework ofcomparative study of Statistics, this paper collects, studies, organizes and summarizesrelative literature in large database such as Pubmed, Embase, CNKI, Wanfang, VIP,etc. The framework is constructed based on existing statistical knowledge as well asrepeated discussions and modifications with department staff and students.Regarding the comparative study of various methods dealing with missing data inrepeated measurement design, methods are introduced and compared in principle firstand then through Monte Carlo simulation. The simulated repeated measurement datahas one classification factor and one repeated measurement factor and is analyzed bymixed linear model. Regarding the data with monotone missing pattern, the effects ofthe deletion method, mean imputation method, last observation carried forwardmethod, linear regression method, predicted mean matching method and propensityscore method are evaluated in dealing with three missing patterns and five missingrates. As to the data with arbitrary missing pattern, the capabilities of the deletionmethod, mean imputation method, LOCF and Markov Chain Monte Carlo method areconsidered in dealing with three missing patterns and five missing rates. For bothstudies, the impacts of different imputation times are assessed as well.Concerning the comparative analysis of four robust regression methods, they areintroduced and compared in principle first and then through Monte Carlo simulationon their robustness and efficiency. A linear model is simulated, and the ordinary leastsquares regression as well as the four robust regression methods are compared indealing with situations of the error not meeting the normal distribution and the datacontaminated by different sources and proportions of outliers. At the same time, theirrelative efficiencies are also examined with the efficiency of the OLS regression asthe reference under the premise that the Gauss-Markov assumption is satisfied. At last,the four robust regression methods are evaluated based on their robustness andefficiency.In the comparative study of six selection methods, the forward, backward,stepwise, R2, adjusted R2and Mallow’sC pselection methods are compared inprinciple. Afterwards, the SAS programs are compiled to achieve automatic selection.In order to test the feasibilities of the programs, they are then applied to theinfluencing factor analysis of aerobic fitness effect and the risk factor analyis oflaryngocarcinoma to explore the risk factors of death.In addition, this paper also formulates the invitation both in Chinese and Englishto invite experts at home and abroad to participate in the study.【Results】The paper has constructed the framework of comparative study ofStatistics and performed a deep and thorough study on three sub-topics on regressionanalysis. That is, this paper has explored the merits of various methods dealing with missing data in repeated measurement data, discussed the advantages anddisadvantages of four commonly used robust regression methods, summarized thecharacteristics of six variable selection methods, and compiled SAS programs toautomatically realize variable selection using multiple selection methods inmulti-regression analysis. To be specific, the results and main innovations of thispaper lie in the following four aspects.(1) A scientific and systematic framework of comparative study of Statistics hasbeen constructed, including the comparison of statistical ideas, design methods, datacollection and organization methods, commonly used statistical analysis methods andstatistical application in special fields.(2) The comparative study of various methods dealing with missing data inrepeated measurement design draws the conclusion that under AMP, when the missingmechanisms are MCAR and MAR, the results of the deletion method, SI and MI areall satisfying at low missing rate (10%); with the increase of missing rate, thedeletion method and SI lose the advantage; the latter is even worse than the former.The MI still shows satisfying results by presenting almost the same parameterestimations at low missing rate and satisfying parameter estimations even at themissing rate of50%. The disadvantage of MI is that it overestimates the variability ofregression coefficients to some extent. On the other hand, the imputation effect is notimproved with the increase of imputation time. When the missing mechanism isNMAR, the results of the deletion method, SI and MI are all disappointing.Under MMP, when the missing mechanisms are MCAR and MAR, none of thedeletion method, mean imputation method, LOCF and PS obtains satisfying results,while the results of linear regression and PMM are still pleasing although theyoverestimate the variability of coefficients to a certain degree. On the other hand, theimputation effect is not improved with the increase of imputation time. Under themissing mechanism of NMAR, none of the above methods achieves a desirable result.(3) The comparative study of the four robust regression methods hasdemonstrated that when the error does not meet the normal distribution, the OLSregression is not able to correctly conduct parameter estimation nor hypothesis testing,and the results are quite unstable; however, the Huber M estimation, LTS estimation,S estimation and MM estimation can effectively resist the influence of non-normallydistributed error. When there are outliers in data, no matter in the Y direction or in theX direction, the OLS regression is not able to handle the situation, leading to biasedestimators. When the outliers exist in the Y direction, all the four robust regressionmethods can correctly perform the regression and gain robust results. However, whenthe outliers exist in the X direction, the Huber M estimation loses the robustness,while the LTS estimation, S estimation and MM estimation can still correctly fit themodel and obtain robust results. When the outliers lie both in the X and Y directions, the LTS estimation, S estimation and MM estimation still can get robust results whilethe Huber M estimation cannot. In a word, the Huber M estimation is only robust tooutliers in the Y direction while the LTS estimation, S estimation and MM estimationare robust to outliers both in the X and Y directions.On the other hand, the comparison of the efficiencies of the four methods showsthat when the data meet the premise of Gauss-Markov assumption, the Huber Mestimation has the highest relative efficiency, which is95%compared to theefficiency of the OLS regression; the MM estimation has the second highest, which is85%; next is the S estimation, which is75%; the LTS estimation has the lowest RE,which is only27%. Therefore, by summarizing robustness and efficiency, the MMestimation outperforms Huber M estimation, LTS estimation and S estimation,thereby being a desirable robust regression method.(4) In the study of six variable selection methods, their merits are summarizedand compared in principle first. Afterwards, the SAS programs to automaticallyachieve variable selection by multiple selection methods and output of the best fittingmodel in the multi-linear regression analysis and multi-logistic regression analysis arecompiled. The programs are applied to the influencing factor analysis of aerobicfitness effect and the risk factor analyis of laryngocarcinoma, and found that age,runtime per1.5km and runpulse are factors that influence the aerobic fitness effectand smoke volume, vegetable intake and family history of cancer are risk factors oflaryngocarcinoma. Through the application to the real examples, the programs areproved feasible.【Conclusion】This paper has constructed the framework of comparative study ofStatistics, which draws a magnificent blueprint for the future research anddevelopment of Statistics to some extent. Besides, it has also conducted a deep andthorough comparative study on three sub-topics in regression analysis, which hasobtained satisfying results and set a good example for future research. Through thecomparison of various methods dealing with missing data in repeated measurementdesign, the methods are compared both in principle and in Monte Carlo simulation,considering different missing patterns, missing mechanisms and missing rates, makingthe conclusion more reliable and providing an effective guidance for the bestmanagement in practical research. The study of the four robust regression methodsconsists of the principle comparison and Monte Carlo simulation analysis, which hascomprehensively evaluated the robustness and efficiencies of the methods, therebylaying the foundation for the popularization and application of robust regressionmethods. Besides, the six variable selection methods are compared and the SASprograms to automatically achieve variable selection in multi-regression analysis arecompiled, tested and verified by practical examples, which assist researchers to findthe best fitting model in practice.
Keywords/Search Tags:Framework of comparative study of Statistics, Missing data, Robust regression, Variable selection method, automatic SAS realization
PDF Full Text Request
Related items