| With the development of Bioinformatics technology,the amount of data used to study biological information has a sharp increase.Especially after the second-generation sequencing technologies come to play,the data volume of DNA and RNA sequence explodes.Hence,the analysis of the tons of biological sequences data is challenging.The sequence analysis is the most basic operation in Bioinformatics which provides important support for the other biological work.The traditional sequence alignment-based algorithms generally have huge time consumption and space complexity.How to reduce the complexity of time and space requirement without degrading the performance accuracy is an important research topic.Targeting at these issues,this thesis proposes two novel algorithms which all have the ability to reduce time requirement and improve performance accuracy.Specifically,the thesis contributes the fields at these two main tasks:1.A weight-based Kendall(WK2)algorithm was proposed which aims at the huge time complexity of existing sequence algorithms.Firstly,a suffix tree is used to extract the characteristics of the sequence.Secondly,the weight-based Kendall correlation proposed in this thesis is used to calculate the similarity among sequences.WK2 avoided the traditional dynamic program’s pitfall,the time complexity of WK2 is O(nlogn),where n is the size of data set,which is a great improvement over O(n2)time complexity of the existing works.Experimental results show that WK2 performs better than the state-of-the-arts algorithms.WK2 is also compatible for data sets with different structures.2.A Local Sensitive Hash(LSH)clustering algorithm based on information entropy is proposed.LSH solves the problem that existing algorithms is low time efficiency,low accuracy,and difficult to be interpreted for its biological meaning of clustering results..LSH uses the p-stable distributed local sensitive hashing method to reduce the time complexity by finding similar sequences.Position information entropy is used as the feature vector of the hash function to increase accuracy.The edit distance is applied as a distance measure to evaluate clustering results while the edit distance has good interpretability in biology.Standard entropy based on location information is regarded as the feature vector of local sensitive hash function to cluster biological sequences.The experimental results show that the execution time of the LSH algorithm is linearly related to the dataset size.It has competitive experimental results in different magnitude datasets.The effectiveness of the LSH algorithm is verified by the experimental results both on simulated data and real data.Sequence analysis is the basis of Bioinformatics.This thesis provides two novel algorithms for sequence analysis(alignment-free based).These two algorithms reduce time complexity and space complexity without degrading performance ability.It is practical for many biological work by screening sequences and finding target sequence in huge volume of data.Besides DNA,RNA,and protein sequences,these two novel algorithms can also be applied to other sequence-based data,i.e,streaming data from Internet.It may have big potential applications in nowadays big data era. |