Font Size: a A A

New Methods And Applications Research On Big Data Of Biological Sequence And Structure

Posted on:2021-12-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:1480306542496564Subject:Mathematics
Abstract/Summary:PDF Full Text Request
In recent years,classifying and analyzing biological big data has become one of the most important research areas in bioinformatics as the tools for getting biological sequences and structures increase.Obtaining information for different kinds of da-ta by reasonable and effective methods,comparing the similarity or dissimilarity of sequences or structures,analyzing the evolutionary relationship of different species,and determining the class of problematic data have become very important research directions.In addition,predicting the class of new data and inferring their biological functions or properties are also important research problems in taxonomy.Tradition-al taxonomic methods are usually too complex to make it difficult to deal with the classification and phylogenetic analysis of a large number of sequences or data with complicated structures.This motivates us to propose new methods for sequence and structure analysis.In this dissertation,we will introduce the improved natural vector method for se-quence analysis.In order to further study the distribution of nucleotides and amino acids in one sequence,we first purpose the correlation of nucleotides and amino acids and add this feature to the traditional natural vector.It is a non-aligned rapid representa-tion for sequences.Each sequence is converted into a vector by extracting information such as the number,average position and central moment of each nucleotide or amino acid in the sequence.The correspondence between the sequence and its natural vector is one-to-one.This algorithm can reflect sequence information accurately and effec-tively,and complete sequence comparison based on the distances between vectors with low computational complexity.It is a powerful tool for classification and phylogenetic analysis.We combine the bootstrapping method and natural vector method to calculate the confidence probabilities on phylogenetic trees.Using new method,we systematically analyze four data sets of alphaproteobacterial proteomes in order to reconstruct the phy-logeny of Alphaproteobacteria.In addition,this method is also used to resolve some evolutionary relationships of Prochlorococcus,that the strains SS120 and MIT9211 do not form a monophyletic clade in the phylogeny of Prochlorococcus.Furthermore,The classification performs well on the fungi barcode dataset with high and robust accura-cy.The reasonable phylogenetic trees of ?-coronaviruses we obtained further validate the new method.In order to further explore the distribution law of natural vectors for the sequences in high dimensional space,we verify that the convex hulls constructed by the natural vectors of sequences from the same family do not intersect with convex hulls of natural vectors from other families,and propose the convex hull principle.This principle indi-cates that sequences with similar distribution should be in the same family.We verify this principle computationally by using all available and reliable sequences on protein kinase datasets and human proteins as well as all DNA barcodes.It also provides a quick way to search for natural vector points that lie within the convex hull of a given class and discover new sequence.This will open up a new interdisciplinary research in biology,mathematics and computer science.For structure comparison,we extend the Yau-Hausdorff method,which achieves the best match of two protein structures accurately in a fast way with low complex-ity.This method measures the similarity or dissimilarity of structures on account of descending dimension in calculation without losing any information.It can also infer protein function by structural similarity.The new algorithm is compared with some traditional structure comparison methods to show the accuracy and stability of our ap-proach in structure comparison.
Keywords/Search Tags:Sequence Analysis, Classification and Phylogenetic Analysis, Improved Natural Vector Method, Convex Hull Principle, Protein Structure Comparison
PDF Full Text Request
Related items