Font Size: a A A

Robust Significant Feature Detection by Learning Discriminant Boundary in Multi-dimensional Space of Statistical Attributes

Posted on:2017-06-19Degree:Ph.DType:Thesis
University:Brandeis UniversityCandidate:Bei, YuanzheFull Text:PDF
GTID:2468390014958742Subject:Computer Science
Abstract/Summary:
This thesis proposes a novel framework to robustly detect significant features by adaptively optimizing the integration of multiple feature scoring metrics. Significant feature detection is a critical process in many kinds of big-data applications. Its main purpose is to mine a complex dataset, which contains a large number of features, to detect features whose "behaviors" are significantly different between conditions. Such features can be genes, genomic methylation states, relationships between linguistic entities, and so on. For example, high-throughput technologies (e.g., Microarray, Deep Sequencing, etc.) have become pervasive in biological and biomedical investigations to simultaneously measure tens of thousands genomic features (e.g., genes, RNA splicing, DNA methylation, mutations, etc.). Accurate identification of significant features is essential for designing the follow-up experiments. In another scenario, a huge volume of unstructured messages is being poured from vast unsynchronized communication threads onto online social platforms, which are often too overwhelming for human users to follow. Automatic discovery of temporal dependency (i.e., one kind of significant feature) between messages can greatly facilitate communications.;A common way to detect significant features is to rank each feature by a score approximating its relevance to the index of interest. For example, in genome-wide data analysis, people are interested in detecting genomic features that are differentially expressed between two conditions, usually a target group and a control group. Performing inter/intra group statistical tests is a traditional approach to measure the significance of how each genomic feature is differentially expressed. Each type of statistical test has its own advantages in characterizing certain aspects of differences between population means and often assumes a relatively simple data distribution (e.g., Gaussian, Poisson, negative binomial, etc.), which may not be well met by the datasets of interest. It is known that weak assumptions about data distributions can lead to poor results when dealing with complex differential expression patterns. Therefore, it is critical to choose the appropriate statistical test, and more generally, feature scoring metric that suits the underlying data distribution.;This thesis will be composed of two major parts. The first part defines the mathematical model and briefly introduces the algorithm at a high level. In order to better explain the workflow, we emphasize the above differential expression problem in genome-wide data analysis, yet the framework is not limited to this application. The proposed framework aims to capture differential expression information more comprehensively by learning the optimized integration of multiple statistical attributes, each of which has relatively limited capacity to summarize the observed differential expression information. The problem is then framed into a learning problem: learn optimal discriminant boundary in a multi-dimensional space of basic attributes, each of which can be a test statistic or other feature scoring metric. The learning problem is further mathematically formulated as a constrained optimization problem that aims to maximize discoveries under a user-defined false discovery rate (FDR). FDR defines the expected type I error ratio when conducting multiple comparisons. FDR control is a widely used approach in deciding the cutoff point, which distinguishes significance and non-significance.;We developed an effective algorithm named "Discriminant-Cut" to solve an instantiation of this problem. Extensive comparisons of Discriminant-Cut with other cutting-edge methods were carried out to demonstrate its robustness and effectiveness. The results showed that it is significantly advantageous to combine multiple basic attributes in detecting differential expressed genomic features in the application of genome-wide data analysis. Both synthesized datasets and real-world datasets will be used in the comparisons.;In the second part we will extend the framework to another application -- automatic inference of conversation structures in online text messages. We plan to analyze short text conversations and frame the problem into a significant feature selection problem, in which each feature is a connection between two randomly chosen messages.;This thesis will also present key implementation details that affect the performance of Discriminant-Cut. For example, we incorporated several heuristics in the implementation of the algorithm to greatly improve its efficiency. This allows the algorithm to run fast in practice. We plan to enhance the framework with a parallel computing capability so that it can be deployed on large clusters to take advantage of parallel computing. In addition, we will develop a "Discriminant-Cut analysis suite" that provide user-friendly GUIs for users to not only analyze their datasets without complex operations and parameter settings, but also customize their own feature scoring metrics.
Keywords/Search Tags:Feature, Detect, Statistical, Data, Framework, Differential expression, Attributes, Multiple
Related items