Font Size: a A A

Vectorization Generalizations in Genomics and Transportation

Posted on:2014-09-29Degree:Ph.DType:Thesis
University:University of Illinois at ChicagoCandidate:Hernandez, Troy AFull Text:PDF
GTID:2458390005990401Subject:Statistics
Abstract/Summary:PDF Full Text Request
The process of transforming a sample to a pair of input and output vectors is sometimes referred to as vectorization". Those samples and their respective vectorizations are used within various learning algorithms to create a model that makes predictions about unknown output vectors given known input vectors. Finding a good vectorization and algorithm combination is the source of a lot of work in various statistical learning applications. This thesis aims to compare, generalize, and improve existing vectorizations within the fields of bioinformatics and transportation.;There have been many proposed methods for phylogenetic classification of viruses. Performing these classifications in a timely manner is of interest to researchers and to those ensuring national security. While multiple sequence alignment remains the tool of choice for practitioners for reasons of interpretability, alignment-free methods have gained popularity due to the substantial increases in speed they provide.;We first extend the natural vector description of genomes to handle viruses and various issues unique to viral genomes. We provide an alternative definition of the natural vector that is able to handle ambiguous nucleotides. We provide a bound on the distance induced by the natural vector between a genome and a mutation of that genome due to a single-nucleotide polymorphism (SNP).;Applying these methods, we test the ability of the natural vector to accurately classify viruses using the National Center for Biotechnology Information's (NCBI) collection of 2044 virus reference sequences (RefSeq) that covers the range of known viruses derived from all 7 Baltimore classes, 73 families and 253 genera. We then compare these classification results to the predominant method of measuring genome similarity, multiple sequence alignment (MSA).;We then present a new family of alignment-free vectorizations of the genome that maintains the speed of existing alignment-free methods and incorporates the interpretability of sequence alignment. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector.;For the first time, we provide a thorough comparison of 5 popular characterizations of genome similarity using k-nearest neighbor classification, and evaluate these on two collections of viruses. The first is the NCBI RefSeq collection above. This informs us of the quality of the various vectorizations' high-level classifications; i.e. Baltimore class, family, and genus. The second collection comes from the online PAirwise Sequence Classification (PASC) tool and consists of 53 families/genera of curated viruses for a total of 9545 viruses. This collection informs us of the quality of the various vectorizations' low-level classifications; i.e. species. From these classification results we make recommendations for reclassification of some viruses.;The prediction of bus arrival times is important for users of public transportation. This problem has received some attention with various authors proposing different vectorizations and different representations of the problem. For example, some propose to have a different models for different times of the day, while others suggest using the same model throughout the day that uses the posted schedule as a parameter within the model. We first generalize the vectorizations and representations existing in the literature. We then propose a method of recovering the schedule and show that the use of this schedule uniformly improves all existing methods using 3 weeks of Chicago Transit Authority (CTA) bus data.;Lastly, we analyze data usage from reporting real-time GPS traces. The problem of tracking a GPS device relies upon predicting vehicle location in general, as opposed to predicting vehicle location on fixed routes as above. We propose an online method that uses historical location data. We compare this method of location prediction to commonly used methods of location prediction using a metric based on the efficiency of mobile data usage. Comparison of 12 different tracking methods are done on two data sets. The first from Microsoft Research (MSR) and the second from the UIC shuttle. We show that at low-error tolerances the methods are equivalent, but at higher-error tolerances the proposed method is greatly more efficient.
Keywords/Search Tags:Vector, Methods, Viruses
PDF Full Text Request
Related items