Font Size: a A A

Research On Metagenomic Sequence Binning Algorithm Based On Feature Vectors

Posted on:2016-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:2308330470957806Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Metagenomic sequence binning is a fundamental question for metagenomic studies. The experimental methods require high operating cost and slow speed for classify massive sequences. Therefore, computing methods which use binning algorithms have been a new trend for binning these sequences. Main binning algorithms may be classified as alignment-based or composition-based methods. Since the former needs the whole sequence information of known genome for alignment, the latter only needs feature vector information. However, existing composition-based methods may only achieve as low as60%in binning accuracy at the low taxonomic levels, and its running time becomes unacceptable for working on the huge amount of metagenomic sequencing datasets. Composition-based methods are classified as supervised and unsupervised methods and we study the supervised binning methods. For metagenomic sequencing data, this work designs a method of feature vector extraction and proposes a fast and accurate metagenomic binning algorithm for multiple species and low taxomomic levels. The main research content and contributions are as follow:1. A method of feature vector extraction for matagenomic sequencing dataFor metagenomic sequences, we combine the property of probability transferring matrix of Markov model and propose a new feature extraction method, then we obtain the feature vector sets and verify the discrimination of feature vectors among various species. We also apply the dimension reduction method which based on mutual information selection to process these feature vectors. Meanwhile, we apply this new method and feature extraction method based on k-mer frequency information to LIBSVM algorithm for performance comparison. The experimental results show that LIBSVM algorithm which uses the new proposed method is2-3%higher than that of using k-mer frequency information in binning accuracy, as well as fourth or fifth higher in binning running time.2. SVM algorithm based on feature vectors named MarkovBinningFirstly, we process the feature vector sets of known species to filter the noise data. We define a new method for similarity measurement and compute the center feature vector for filtering. The filtered feature vector sets are taken as training sets for SVM algorithm. We use the grid seach method based on variable steps and improve the optimization of punish coefficient and kernel function parameter (C,y) for speeding up this process. At last, a comprehensive comparison of both binning accuracy and execution time is conducted between MarkovBinning and the existing algorithms, including TACOA, AbundanceBin and MetaCluster. The experimental results show that MarkovBinning outperforms all the other algorithms by10%in the averaged accuracy, within significantly reduced time.
Keywords/Search Tags:sequence, feature vector, Markov model, dimensions reduction, parameters optimization, SVM algorithm
PDF Full Text Request
Related items