Font Size: a A A

Software and Website Development for Data Analysis and Management of DNA Barcoding

Posted on:2015-06-07Degree:Ph.DType:Thesis
University:The Chinese University of Hong Kong (Hong Kong)Candidate:Fan, LongFull Text:PDF
GTID:2478390017495439Subject:Bioinformatics
Abstract/Summary:
Species identification based on short sequences of DNA markers, i.e., DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multi-locus barcoding datasets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g., ≥ 5, 000 sequences), but its accuracy is a concern and has been criticized for its local optimization. And the current more accurate software requires sequence alignment or complex calculations which are time-consuming when dealing with large datasets during data preprocessing or the search stage. Therefore, it is imperative to develop a practical platform for both accurate and scalable species identification for DNA barcoding.;In the first part of this thesis study, I developed VIP Barcoding (Vector-based Identification Platform for DNA Barcoding): a user-friendly software in graphical user interface for rapid DNA barcoding. VIP Barcoding adopts a hybrid, two-stage algorithm. First, an alignment-free and cosine similarity-based composition vector method is utilized to reduce searching space by screening a reference database. The alignment-based K2P distance nearest neighbor method is then employed to analyze the smaller dataset generated in the first stage. In comparison to other software, I demonstrate that VIP Barcoding has: (1) higher accuracy than Blastn and several alignment-free methods, and (2) higher scalability than alignment-based distance methods and character-based methods. These results suggest that this platform is able to deal with both large-scale and multi-locus barcoding data with high accuracy, and can contribute to DNA barcoding data analysis. VIP Barcoding is free and available at: http://msl.sls.cuhk.edu.hk/vipbarcoding/.;In the second part of my thesis study, with the aim to accelerate VIP Barcoding further, I integrated the locality sensitive hashing (LSH) method into VIP Barcoding and improve it to be LV Barcoding (LSH-based VIP Barcoding). The linear searching in the first stage of VIP Barcoding belongs to brute-force search, which still has much room for optimization. After the modification, LV Barcoding runs faster than VIP Barcoding , in which LSH is utilized to project each barcode in the reference databases into labeled buckets during data preprocessing and enables the following query stage to only search for match in buckets containing less number of reference barcodes. LV Barcoding would be released on the same web-page of VIP Barcoding.;In the third part of the study, I constructed a region-specific barcoding website to enable species retrieval and identification of fauna in Hong Kong. Hong Kong is a maritime city with the >800 km coastline and harbours a rich diversity of fauna with many species which are hard for non-specialists to identify. This Hong Kong-oriented DNA barcoding website would provide more details (e.g., photographs, information on voucher specimens, literature reference, and distribution in Hong Kong and elsewhere) about these species. The website provides a framework for a database in which COI barcodes across all common taxonomic groups of macro-fauna can be stored and relieved in the future that allows rapid and accurate species identification. It is expected that this website could facilitate the effective conservation management and public education of marine species of Hong Kong.;In sum, I integrated the cosine similarity and LSH with the basic approach of the composition vector method and proposed a novel algorithm. Based on the new algorithm, the directly accessible software (LV Barcoding ) for data analysis of DNA barcoding was developed and freely available for the community. A regional DNA barcoding website for the data management and taxonomic identification of fauna in Hong Kong was constructed.
Keywords/Search Tags:DNA barcoding, Data, Website, Identification, Hong kong, Software, Management, Species
Related items