Font Size: a A A

Exploration And Application Of Visualization And Feature Numeralization For Multiple Sequence Alignments

Posted on:2023-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhouFull Text:PDF
GTID:2530306902992459Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
1.Visualization of sequence alignmentsThe identification of the conserved and variable regions in the multiple sequence alignment(MSA)is critical to accelerating the process of understanding the function of genes.As the sequence-structure-function relationship gains increasing attention in molecular biology studies,the simple display of nucleotide or protein sequence alignment is not satisfied.Despite the fact that existing visualization tools provide diverse functions for mining various types of information from MSAs,a number of issues remain unresolved.Firstly,it is difficult to capture molecular characteristics hidden in MSA by simply displaying nucleotide or protein sequence alignments at the site level.Other information such as residue dominancy and residue dependencies is also helpful when presenting the MSA data.Secondly,due to the altering of sequence fragments,recombination events also require a more intuitive way to present.Third,it is still challenging to combine external data with MSA in an efficient,accessible,and customizable way.Last,visualizing genome alignment involves presenting aligned fragments between species and rearrangement information in an efficient way,but this is rarely covered by other tools.To address these issues,we implement ggmsa,an R package providing a comprehensive set of methods for analyzing and visualizing the MSA by individuals or groups.We implemented a set of functions including sequence logo,sequence bundle,stacked sequence alignment visualization,and nucleotide comparative plots.These methods help in the identification of conserved or various trends in MSAs,sequence residue-residue dependencies and are utilized to mine the clues of recombination events.In addition,to explore the correlation between sequences and corresponding individual phenotypes or others,ggmsa implemented integrated visualization of MSA,phylogenetic trees,and associated data(e.g.,ancestral sequences,expression levels,genome locus structure,molecular functions)with the assistance of the in-house developed packages,ggtree,and ggtreeExtra,it helps to discover the underlying evolutionary features.We also design a new visualization method for genome alignments in Multiple Alignment Format(MAF)to explore the pattern of within and between species variation.2.Represent biological sequences into numerical valuesNumerical sequence features are numerical vectors recognized by computers.It is often used in the prediction and classification of biomolecules.In addition to the visualization of sequences,this study also tries to explore new application methods of numerical sequences We designed an R package--UltraPseR,which is a wrapper of UltraPse and contains multiple sequence coding schemes.It can transform the composition and order of nucleotide sequences or protein sequences into fixed-length numerical vectors.That data can be fed directly into machine learning.UltraPseR package allows users to quickly transform biological sequences into numerical vectors,which can be combined with other machine learning algorithms to efficiently complete the prediction and classification tasks of biological sequences.In this study,UltraPseR was applied to Human Leukocyte Antigen gene sequences,and support vector machine was used for numerical HLA to explore the feasibility of the numerical sequence method in HLA genotyping.
Keywords/Search Tags:Multiple sequence alignment, Visualization, Represent sequences, Phylogeny, Machine learning
PDF Full Text Request
Related items