Font Size: a A A

Study On The Statistical Distribution Of Global Alignment Optimal Scores For Random Protein Sequences

Posted on:2006-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:H X PangFull Text:PDF
GTID:2120360182970299Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Sequence alignments are widely used in all branches of bioinformatics study, from function and structure prediction, database search to phylogenetic analysis. However, the alignment score alone tells little information. More important is the inference of biological significance from statistics significance viz. the probability of getting a score so great or greater in random sequences. This research aims to find out the theoretical distribution of global alignment optimal scores of random proteins sequences.Real and unrelated sequences and random sequences prepared in five different ways were used as reference dataset in this research. Global alignments with Needleman-Wunsch algorithm were carried out and optimal scores were fitted to gamma, normal and Gumbel distribution respectively.The real and unrelated sequences (RUS) ere extracted from SCOP database with three criterion, i.e. sequence percent identity less than 10%, E-value greater than 10 and representatives of different folds. The extracted three sequence files were further processed and 15 sequence file were got with different sequence length. 11 pairs of sequences were randomly selected from SCOP database to be randomized in five different ways as the random sequences in this study. The five randomization approaches were: 1)maintaining the sequence length and the average amino acid composition of proteins(ACL);2) Maintaining the sequence length and the amino acid composition of the authentic sequences(CLA);3) Global shuffling or permutation (GS);4) Local shuffling or permutation (LS);5) Simulation of the mutational process (SMP).Four scoring matrices- PAM120, PAM250, BLOSUM50 and BLOSUM62 were selected and global alignments were carried with affine and constant penalty respectively and penalized on end gaps. Alignments were carried out between every pair of sequences of every sequence file for RUS sequence sets, For CLA, GS and LS sequence sets the first sequence was aligned with 10,000 randomizations of the second, then vice versa. For ACL and SMP sequence sets, two sequences were generated at a time and global alignments were carried out between them, this process was repeated 10,000 times.All optimal scores were fitted with three parameter generalized gamma distribution, normal distribution and Gumbel distribution respectively. The result shows that gamma distribution performed best in the fitting with all scores and Gumbel distribution performs the worst, normal distribution agrees with the score frequencies well when the shape parameter of the fitted gamma distribution was huge as this is the scenario when normal distribution could be viewed as the approximation of the gamma distribution. We also found that the shape parameter of gamma distribution increased and scale parameter decreased gradually as the sequence length increased and when the window size increases, the shape parameter of the fitted gamma distribution decreases. However, the effects of the four substitution scoring matrices are minor in the distribution of global alignment scores.
Keywords/Search Tags:global alignment, random sequence, gamma distribution, normal distribution
PDF Full Text Request
Related items