Genome data modeling and data compression

Posted on:2008-11-05

Degree:M.S

Type:Thesis

University:University of Nevada, Reno

Candidate:Radhakrishnan, Radhika

Full Text:PDF

GTID:2448390005472718

Subject:Computer Science

Abstract/Summary:

Genome data modeling is an important area of research and different data models have been proposed for representing and storing data. Some of the challenges in biological data management are data storage, retrieval, data redundancy, and data integrity. In this thesis we propose two data models for representing and storing genome sequence data. In these models we propose that, instead of storing the whole gene sequence for each gene separately, we store common sub sequences only once, with a sequence ID or GenBank identification number. We also store the position number, so that the whole sequence can be retrieved correctly. This would significantly reduce storage space requirements and help maintain data integrity. In our second model a pre-coding routine is also included to further reduce storage requirements. A study of randomness in genome data is also included. Both data models were tested and the results were satisfactory. We were able to compress the sequence, when there was significant amount of commonality, and the retrieval algorithm was able to retrieve the sequence correctly.

Keywords/Search Tags:

Data, Genome, Sequence

Related items

1	A Reusable Visualization Tool Of Genome Sequence Based On MVC Design Pattern
2	Construction Of WEB-based Visualization System On Genome Structure Annotation Data
3	Compression Algorithms On Highly Similar Genome Collections
4	Research And Implementation Of Index Structure Of Biological Sequence
5	Design And Optimization Of Parallel Algorithm For Biolgogical Sequences
6	Study On Algorithms For Identification Of Repeats In Large-scale Genome
7	Graph algorithms for assembling integrated genome maps
8	Research On Automatic Gene Structure Annotation System For Eukaryotic Genomes
9	Distributed Gene Sequence Similarity Calculation Based On Secure Multiparty Computation
10	Integrating experimental high-throughput transcript detection data into probabilistic gene finding