Font Size: a A A

The Research On Similarity Of DNA Sequences Based On Information Discrepancy

Posted on:2010-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:2178360275981831Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of HGP (Human Genome Project) and the research on gene sequences and protein sequences, the databases for molecular sequences data and structure data are getting huger and huger. The need to analyze and process these data accelerates the development of Bioinformatics. Analysis of similarity of sequences is one of the most important aspects of Bioinformatics. It has been widely used in classifying genes, predicting the structure and function of sequence, phylogeny of species and so on.This dissertation mainly analysis the similarity of DNA sequences and algorithms to cluster them based on information discrepancy.The function of degree of discrepancy (FDOD) is widely used in bioinformatics. Based on the study of FDOD, we proposed a new representation called base-base information set to characterize the DNA sequences, then computed the distance between two sequences by FDOD. The base-base information set comprises the joint probabilities of base pair at a distance of 1 to L, where L is an alterable parameter. The size of base-base information set increases linearly with L, while the size of complete information set increases exponentially with the length of subsequence. Also we analyzed how the distance changes when L is changed. As the experimental result shows, the distance between two sequences is insensitive to the change of L when they are similar, our method is effective for analyzing similarity of sequences.We analyzed the relation between FDOD and Shannon entropy. FDOD calculates the change of entropy when two sequences are clustered. We introduced generalized information distance (GID) which calculate the change of total entropy when two sequences are clustered, and modify FDOD and GID by the length of sequences. Modified GID is effective for analyzing similarity of sequences whether they are closely similar or not. Then we proposed a method for directly clustering a group of sequences based on modified generalized information distance. As the experimental result shows, our method is feasible and effective.
Keywords/Search Tags:BB information set, FDOD, generalized information distance, direct clustering, similarity of sequences
PDF Full Text Request
Related items