Font Size: a A A

Guaranteed Similarity Metric Learning Framework For Biological Sequence Comparison

Posted on:2016-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:K R HuaFull Text:PDF
GTID:2180330461966648Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Biological sequence analysis is an important component of bioinformatics. Biological sequence comparison is a key technique of analyzing and predicting the sequence structure, function and genetic information. Biological sequence analysis techniques can be mainly divided into alignment methods and alignment-free methods. Alignment methods are generally outperform alignment-free methods on the results but with high complexity. So its efficiency has been criticized. Alignment-free generally refers to the use of statistical methods for statistical analysis of biological sequence data, including the famous k-word methods. Alignment-free process generally contains two steps: The first step is to construct the numeric characteristic vector and after that you should choose a similarity metrics(distance metrics) to measure the degree of similarity of biological sequences. In order to achieve a better performance of biological sequence comparison, the traditional alignment-free approaches mostly focus on the numerical characteristics of the sequence by trying to improve numeric characteristic vector of the sequences without giving the similarity metrics(distance) with sufficient attention. This kind of alignment-free methods mostly use the traditional distance similarity, such as Euclidean distance, Mahalanobis distance, Shannon entropy, relative entropy, information entropy, K-L etc. These distance metrics definitely can be used as similarity metrics, and sometimes it can even achieve a good performance. However, it does not have the capacity of mining information, and they cannot be "made-to-measure" for every distinct data set. With the development of machine learning, such a "made-to-measure" is possible.This study consists of two parts: Firstly, Based on nucleotide triplet codons and the relationship between nucleotide triplet codons and amino acids, a 3-dimensional graphic representation of protein sequences is outlined. Then a numerical characterization including the location, number and distribution information of all the 20 kinds of amino acids is proposed. The similarity/dissimilarity analysis of ND5 protein sequences of nine species is done, and our approach is compared to other approaches recently proposed based on the coefficient of correlation of the results of these approaches with the results calculated by ClustalW. It shows that our approach has better correlations with ClustalW for all nine species than other approaches, which gives an intuition of better performance. Secondly, in order to shadow the drawbacks of traditional similarity metrics, we try to learn a similarity metrics from every specific data set. It is feasible to introduce machine learning technology to learn similarity metric from biological data. Based on the the “goodness” similarity theory to Mahalanobis metric learning, we propose a novel framework of guaranteed similarity metric learning(GMSL) to perform alignment of biology sequences in any feature vector space based on the numeric characteristic feature vector.The experiments with representative datasets of data and algorithms demonstrate that our approach outperforms the state-of-the-art biological sequences numeric characteristic representation and similarity metric learning algorithms in both accuracy and stability. We conclude the results as following:1. our numerical features are simple but efficient.2. our numerical characteristics is more discriminative for protein sequences compared with the numerical characteristics of k-word.3. GMSL is capable to improve both accuracy and stability of bio-sequence alignment.4. even if it is given a very rough numerical character represents, through GMSL get a desired result.5. in the case of other similar algorithms fail, GMSL algorithm can also present a more ideal than the effects.6. Thanks to its established mathematical foundation, GMSL outperform other algorithms. It guarantees minimal errors.
Keywords/Search Tags:biological sequence, similarity analysis, machine learning, similarity learning
PDF Full Text Request
Related items