Font Size: a A A

Information Extraction for Virus Classification and Robust Dimension Reduction

Posted on:2015-01-22Degree:Ph.DType:Thesis
University:University of Illinois at ChicagoCandidate:Huang, Hsin-HsiungFull Text:PDF
GTID:2478390017992119Subject:Statistics
Abstract/Summary:
As technology improves, big data are collected more easily and frequently. Especially plenty of genome sequences are discovered from laboratories as well as sound and images files are acquired from both academia and industry. Many new statistical methods have been developed to cope with the corresponding computational problems. This thesis involves comparing multiplesegmented viral genome sequences accurately and quickly, and carrying out principal component analysis on high dimensional and functional data with outliers. This summary aims to help readers catch the essence of my study.;This thesis consists of two parts: multiple-segmented virus classification and iterative kernel robust principal component analysis. These two parts share a common characteristic: they both encounter computational difficulty due to either large sample sizes or high dimensions so that there is a need of transformations to reduce the dimensions and keep as much of the original information as possible. This is a "big data" problem, one of the most popular challenges for statisticians and information scientists.;For the first part, my study mainly focuses on classifying multiple-segmented viruses using their genome sequences. Virologists usually only compare a single genome segment of one virus to each other. There has been a lot of research on developing sequence comparison approaches. In one of our past study, we proposed the natural vector method, and showed that it predict single-segmented viruses accurate and quickly. However, it remains a question, how to compare multiple-segmented viruses, especially when they have unequal number of segments or we have insufficient biological information. For example, In uenza virus A has 8 segments, but scientists sometimes find part of these segments. If we only found 6 segments from a new Influenza A virus strain, how could we compare it with other viruses? One solution is comparing them segment by segment. However, in order to use full viral genome information, we would like to compare all segments simultaneously. Currently, the consensus tree method is a widely used tool which can combine phylogenetic trees of segments. However, this approach has not been validated. Therefore, we propose a new method to measure the similarity of multiple-segmented viruses accurately and quickly in Chapter 2. In Chapter 3, we analyze the newly mutated Influenza A virus, H7N9. The first case of human infections was reported in 2013. The distance between viral genome sequences has been used for classification of viruses. The core question is to measure the similarity of two multiple-segmented viral genome sequences which may contain unequal numbers of segments.;For the second part of this thesis, we review the robust kernel principal component analysis and the potential computation and accuracy problems that might occur. Thus we propose a novel iterative robust kernel principal component analysis to cope with both of these problem in Chapter 4. The computation and storage costs of implementing singular value decomposition (SVD) are proportional to cubic and square of the sample size respectively. To overcome these challenges, various kinds of online principal component analysis (PCA) were developed for sequential data. However, online PCA tends to be in uenced by noises or outliers more easily than the classical PCA. Meanwhile, a robust PCA using a differentiableweight function was proposed against the effect of outliers in the classical PCA and kernel PCA. It is natural to combine the advantages of both methods, so that we propose an iterative robust kernel principal component analysis to solve the above challenges.;Each chapter contains a brief outline which contains the main ideas, and states the main contributions in each chapter. These chapters illustrate either the background knowledge or our new methods and applications.
Keywords/Search Tags:Genome sequences, Principal component analysis, Virus, Information, PCA, Chapter, Classification, Data
Related items