Font Size: a A A

A Novel Measure For Sequence Comparison On The Basis Of K-word Position

Posted on:2016-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:J TangFull Text:PDF
GTID:2308330461966649Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the continuous development of technology, bioinformatics was born, this new discipline is the combination of the life sciences and computer science. Bioinformatics research including genomic annotation, data collection and management of biological molecules, molecular evolution, database searching and sequence alignment, protein structure prediction and gene expression data analysis and processing and so on. The main article is to compare biological sequences.Sequence alignment methods are: alignment methods and non-alignment method. The initial method of sequence alignment is alignment methods. If the sequence is short and very similar, with the result of sequence alignment methods well, but, if the sequence is not similar, and a big difference between the length of the sequence, the sequence alignment reliability is not high, with the result of sequence alignment methods is not well. Another important reason is to calculate the alignment methods is too complicated and time-consuming, especially now entered the era of big data, large-scale sequence data is difficult to handle with the traditional methods, so the non-alignment methods were born. Non-alignment methods of the sequences used a lot of knowledge of other disciplines, such as statistics, machine learning. Into the era of big data means that we will face a lot of data, and how to deal with these data, how to extract important information from these data, this is crucial. Therefore machine learning can help us to do these things.Machine learning to design some algorithms that makes the computer can automatically " learn". The machine learning algorithms can automatically obtain useful knowledge from known data, and then use this knowledge to predict unknown data. Machine Learning includes many classic algorithms, such as classification algorithms, regression algorithms, clustering and association analysis algorithms, these algorithms are very useful to process and analyze data.The purpose of this paper is to extract more DNA sequences of phylogenetic information to propose a new non-alignment method on position, using machine learning methods to process data. Methods used are:First, with the cross-validation method to select the best model, and then calculate the area of ROC(AUC) to evaluate our approach;Second, cluster analysis method to do phylogenetic tree;Third, the use of multi-classification SVM method to classify DNA sequences;Fourth, painting D(k) feature curve.The results obtained:First, when the k value from 2 to 5, AUC of our method is always greater than the AUC of another method;Second, in the first data set, when k = 6,7, the phylogenetic tree is very stable, consistent with the results of the authority; in the second data set, when k = 5,6,7,8,9, the phylogenetic tree was stable and consistent with the results of the authority.Third, the classification results of multi-classification SVM are good, all of the AUC are larger than 0.95.Fourth, the D(k) feature curves can be directly observed that the original data are divided into four categories.
Keywords/Search Tags:DNA sequence, machine learning, phylogenetic tree, sequence alignmen
PDF Full Text Request
Related items