The Research On Similarity Of DNA Sequences Based On Information Discrepancy

Posted on:2010-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:F Liu

Full Text:PDF

GTID:2178360275981831

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the development of HGP (Human Genome Project) and the research on gene sequences and protein sequences, the databases for molecular sequences data and structure data are getting huger and huger. The need to analyze and process these data accelerates the development of Bioinformatics. Analysis of similarity of sequences is one of the most important aspects of Bioinformatics. It has been widely used in classifying genes, predicting the structure and function of sequence, phylogeny of species and so on.This dissertation mainly analysis the similarity of DNA sequences and algorithms to cluster them based on information discrepancy.The function of degree of discrepancy (FDOD) is widely used in bioinformatics. Based on the study of FDOD, we proposed a new representation called base-base information set to characterize the DNA sequences, then computed the distance between two sequences by FDOD. The base-base information set comprises the joint probabilities of base pair at a distance of 1 to L, where L is an alterable parameter. The size of base-base information set increases linearly with L, while the size of complete information set increases exponentially with the length of subsequence. Also we analyzed how the distance changes when L is changed. As the experimental result shows, the distance between two sequences is insensitive to the change of L when they are similar, our method is effective for analyzing similarity of sequences.We analyzed the relation between FDOD and Shannon entropy. FDOD calculates the change of entropy when two sequences are clustered. We introduced generalized information distance (GID) which calculate the change of total entropy when two sequences are clustered, and modify FDOD and GID by the length of sequences. Modified GID is effective for analyzing similarity of sequences whether they are closely similar or not. Then we proposed a method for directly clustering a group of sequences based on modified generalized information distance. As the experimental result shows, our method is feasible and effective.

Keywords/Search Tags:

BB information set, FDOD, generalized information distance, direct clustering, similarity of sequences

PDF Full Text Request

Related items

1	Research On Topology Relation-based Distance Metric And Clustering Algorithms
2	Similarity Measures In Cluster Analysis And Its Applications
3	The Study Of Sequences With Low(ODD) Even Correlation
4	Research Of Image Recognition Techniques Based On The Semi-supervised Clustering And Generalized Distance Function Learning
5	Information fusion of multiple genomic sensors for clustering and cis-regulatory element identification
6	A Study On Robust Optimization And Diagnostic Analysis Of Multidimensional System Based On MTS
7	Research On The Clustering Analysis Algorithms In Bioinformatics
8	Generalized Information Quality And Its Application
9	Research On Legendre And Jacobi Sequences
10	Similarity Measures And New Clustering Methods For Categorical Sequences