Font Size: a A A

Research On Biological High-Throughput Sequencing Fragment Assembly And Molecular Biomarker Detection Algorithms

Posted on:2016-07-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y WangFull Text:PDF
GTID:1108330479478613Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the research of life science, it is a great significance that rapid and accurate acquiring organism genetic information, which is entirely stored in organism genome. Sequencing technology successfully decipher complicated and multifarious genome comprehensive information among organisms, and advanced the field of researchers’ vision. Recent scientific discoveries that resulted from the application of nextgeneration sequencing(NGS) technologies highlight the striking impact of these fast inexpensive massively parallel platforms on genetics. A new challenge of plenty of data comes into bioinformatics, and an urge demanding for efficient algorithms in massive data analysis. Genome has been sequenced increasingly faster and relevant data has been accumulated strikingly bigger. On the contrary is the performance enhancement of computer tends to slow down in recent years, and then new cloud computing technologies like MapReduce or Spark emerges for present big data processing. However, these new technologies have not a sufficient employment in bioinformatics research. Therefore, with the assistance of basic sequence comparison techniques, strings and graphs algorithms, and Map Reduce technology, we make an intensive study of several critical issues in NGS sequences assembly and molecular biomarker detection. The main contributions of this dissertation are as followings.(1) A clustering method is proposed for NGS data based on Map Reduce.Biological sequence clustering is a classical research topic in bioinformatics for the assistance of downstream analysis. We propose a new greedy clustering method based on Map Reduce with two new sequence similarity measurement. Based on the fact that two similar sequences should share a certain number of k-mers, we present an approximate sequence similarity algorithm by the counting of sharing k-mers which purge unnecessary computation between unrelated sequences. For similar sequences, we present an accurate sequence similarity algorithm by blocks alignment extended from sharing k-mers which further optimized either. With our method, large scale sequences like ESTs or NGS reads can be clustered efficiently.(2) A de novo NGS whole genome assembly method, Seeds Graph, is proposed based on sequences clustering and Map Reduce.De novo whole genome assembly in NGS data hindered by large number of reads, short length and higher base-call error rate. Seeds Graph is scalable for large number of sequences by Map Reduce. We define a seed structure for the representation of sequences clusters, and build graph, Seeds Graph, with seeds as vertices. It use a serious of seeds denoting consensus for reads to avoid short length. Because seed has approximate string match design, It can resolve high error rate in a certain degree. With the help of compatible path analysis, Seeds Graph can handle complex paths in the graph.(3) A genome structural variation biomarker detection method, md SV, is proposed for multiple donors NGS data.For a quick view of relations between phenotype and genotype in organisms, a prevalence strategy is to sequence more individuals with relative less coverage instead of a single deep one. Whole genome structural variation detection results badly in these data. Therefore we proposed a detection method supporting multiple donors NGS data, called md SV, which utilize paired-end and split-read techniques and multiple dataset to detect more structural variations. And with a modified alignment algorithm, mdSV can predict precise breakpoint positions of structural variations.(4) With the integration of sequence comparison algorithms in this work, we design and implement a software for mi RNA biomarker detection from ESTs or NGS data based homologous search and ensemble learning method.In this tool, we search homologous sequences in ESTs or NGS data by references of known mi RNA prematures, and then analysis hairpin structure by RNAfold, then get a rough candidate set of mi RNA precursors. This set is a typical imbalance classification problem because of high false positive rate. We propose an ensemble classifier with voting policy for the classification in this set. We choose known mi RNA precursors as positive samples and deliberately selected negative samples to train multiple single classifiers in a way of imbalance data, and then ensemble them to a single classifier. A high confidence mi RNA precursors set can be predicted from our software, and could be used for downstream of mi RNA analysis and detection research.
Keywords/Search Tags:sequence alignment, sequences clustering, genome assembly, structural variation, high-throughput sequencing
PDF Full Text Request
Related items