Font Size: a A A

High-Dimensional Inference in Dynamic Environments

Posted on:2016-06-06Degree:Ph.DType:Thesis
University:Duke UniversityCandidate:Hall, Eric CFull Text:PDF
GTID:2478390017482351Subject:Electrical engineering
Abstract/Summary:
This thesis addresses several challenges unanswered in classical statistics. The first is the problem of sequentially arriving data in changing environments. It is often the case that the order in which data is received can be critically important to accurate inference, and this observation is especially important when the underlying environment generating the data is likely to be changing and one wishes to study the mechanisms driving this change. For instance, in a wide range of applications, including video analytics, social networks, microscopy, and astronomy, analysts must cope with large quantities of high-dimensional data generated by an environment which is known to be changing. Classical methods often fail on such data because they ignore the underlying dynamics, neglect physical models of data generation and statistics, or fail to scale up to high-dimensions both computationally and statistically. While incorporating time-varying, dynamical models can significantly increase statistical accuracy, the difficulty and complexity of learning are increased because there are now two sources of potential errors: observational noise and environmental change. Classical methods which do not account for time variation would attribute all errors to observational noise, while some techniques, such as those used in autoregressive moving average models, would attribute most of the variation to system dynamics. This thesis presents work which learns a systems' dynamics and state simultaneously, while avoiding such strong assumptions.;The second challenge crucial to the learning process is the ability to handle the recent onslaught of information known as "big data". Two types of challenges big data present are large volumes of data and large dimensionality of data, and algorithmic design must address these challenges. In order to process high volume data algorithms must process data very efficiently in order to complete all its calculations in a reasonable amount of time, while high-dimensional data requires algorithms to perform efficient inference in a potentially vastly underdetermined system. Within both of these regimes analysts face the question of "how much data must be collected before reliable inference can be guaranteed?" Such questions are typically addressed via sample complexity and regret bounds, but the existing literature on these bounds does not account for the sequential nature and inherent dependencies among the observed data. Methods such as LASSO have become popular in the last decade, as they prove that the amount of data necessary should scale more closely to the low-dimensional subspace on which the data resides, as opposed to the larger, ambient dimension. These types of observations are critical, and the need to find low-dimensional structure in high dimensional signals is of paramount importance and should be addressed in proposed algorithms. Nevertheless, most LASSO analysis does not address streaming data or temporal dependencies common when tracking dynamic environments.;This thesis presents novel algorithms and analyses which address the issues posed by big data in dynamic environments. The first contribution is to propose methods to analyze sequential data in a dynamic environment using online optimization. Online optimization schemes are inherently designed for high volume data, as they receive one datum at a time, make a quick calculation, and then discard it. These methods are additionally adept at processing high-dimensional data, as they incorporate regularization functions that attempt to find low-dimensional structure in high-dimensional data. The proposed online optimization routines account for time varying environments with associated regret bounds, while making minimal assumptions on the underlying system. Unlike previous work in sequential filtering, these methods do not need the dynamical model to be specified a priori, but instead have built-in mechanisms to find the best model within a family of candidate dynamics. The next contribution is to use these methods to analyze data generated by a Hawkes process, a model which is used to characterize complex interactions among a collection of point processes, for instance neurons firing within a network. Using this analysis, network inference on streaming data can be performed with more robustness to poorly-specified model parameters than previously existing methods. Finally, previously unknown sample complexity bounds for a discretized version of the Hawkes process, the log-linear Poisson autoregressive process, are shown and proven. These results, which utilize recent advances in statistical learning theory, show how the assumption of low dimensional structure can greatly aid in the estimation procedure of high dimensional data.
Keywords/Search Tags:Data, Dimensional, Inference, Environments, Dynamic, Methods
Related items