Font Size: a A A

Machine learning techniques for alleviating inherent difficulties in bioinformatics data

Posted on:2016-06-05Degree:Ph.DType:Dissertation
University:Florida Atlantic UniversityCandidate:Dittman, David J., IIFull Text:PDF
GTID:1478390017981238Subject:Computer Science
Abstract/Summary:
In response to the massive amounts of data that make up a large number of bioinformatics datasets, it has become increasingly necessary for researchers to use computers to aid them in their endeavors. With difficulties such as high-dimensionality, class imbalance, noisy data, and difficult to learn class boundaries, being present within the data, bioinformatics datasets are a challenge to work with. One potential source of assistance is the domain of data mining and machine learning, a field which focuses on working with these large amounts of data and develops techniques to discover new trends and patterns that are hidden within the data and to increases the capability of researchers and practitioners to work with this data. Within this domain there are techniques designed to eliminate irrelevant or redundant features, balance the membership of the classes, handle errors found in the data, and build predictive models for future data.;This dissertation is an in-depth analysis of how the domain of data mining and machine learning is uniquely suited for alleviating the inherent difficulties found within bioinformatics datasets. First, we will present a number of different gene selection techniques in terms of their stability or robustness. Next, we will present an analysis of the entire process of ensemble gene selection including different approaches for implementing the ensemble and ranked feature list aggregation. Next, we will then provide a framework for using gene selection and classification with the focus of maximizing classification performance while simplifying the machine learning process. Then, we will discuss two new approaches for incorporating ensemble learning along with gene selection while comparing them to the case wherein no ensemble learning approach is applied. Lastly, we will give a detailed analysis of the data sampling process for bioinformatics data including which techniques should be used, when and how they should be applied, and to what extent should the data sampling be performed. Overall, this dissertation presents an thorough analysis on how the use of machine learning techniques can alleviate inherent difficulties found in bioinformatics data.
Keywords/Search Tags:Data, Bioinformatics, Machine learning, Inherent difficulties, Techniques, Gene selection
Related items