Font Size: a A A

High-throughput Data Modeling and Flexible Statistical Method

Posted on:2018-10-07Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Hu, TaoFull Text:PDF
GTID:1448390002450948Subject:Bioinformatics
Abstract/Summary:
With the increased demands for analyzing large and complex data, it is critical to integrate and explore effective methodologies and computational algorithms in the era of Big Data. Big Data has been described as having the ''four V'' properties -- Volume, Variety, Velocity and Veracity. This dissertation aims to explore those properties within the background of modern data, and to develop novel and flexible frameworks which accommodate some drawbacks of traditional statistical methods. First, a zero-inflated beta-binomial model is described for modeling microbiome count data. This framework introduces zero inflation to account for the excessive zero counts, and it can also handle both discrete and continuous phenotypes as well as other covariates. Thus, association tests between microbiome community composition and phenotypes of interest can be performed. A penalization method is also proposed to predict missing phenotypes based on microbiome counts data. Second, three exact testing procedures are proposed for SNP set analysis, which has been widely applied in analyzing next generation sequencing (NGS) data. Comparing with single SNP testing, SNP set analysis is able to detect signals by examing groups of SNPs together. Traditional testing procedures are based on asymptotic null distribution of the test statistic, but they are not valid for small sample sizes. The proposed exact tests will achieve high power even when sample size is small. To solve the computation bottleneck issue for exact testing methods, computationally efficient algorithms are derived and implemented into a user friendly software. Finally, an approximate Thompson sampling strategy is proposed for streaming data analysis. Streaming data is characterized as accumulating properties and high complexity. The proposed strategy is a model-based planning algorithm, and is designed to balance the exploration-exploitation trade-off which is caused by the unknown environment. Approximate Thompson sampling enables us to estimate the system dynamics and optimal policy that give best rewards at the same time. Asymptotic convergence rates are derived for the proposed algorithm. Two real application examples are given to demonstrate the performance of the algorithm.
Keywords/Search Tags:Data, Proposed
Related items