Font Size: a A A

Research On Some Problems Of Data Analysis In Bioinformatics

Posted on:2018-03-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y T FanFull Text:PDF
GTID:1310330515494270Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the completion of the Human Genome Project and Human Microbiome Project,Bioin-formatics,which combines biology,computer science,mathematics,etc.,has been a hot research area.Since high throughput sequencing techniques has remarkably reduced the expense and time of deciphering genome,the amount of bioinformatics data is growing with exponential rate.These data not only provide great convenience for researches to better unveil the mystery of life,but also bring significant challenges to analyze these massive amount of data efficient-ly.Specifically,this thesis focuses on the following three aspects:pattern recognition from big bioinformatics data;analysis and prediction of these bioinformatics data;visualization of the results.(1)DNA sequence analysis plays a very important role in gene regulation study.Motif discovery is a complicated but critical problem in DNA sequence analysis.Recently,motif discovery has attracted much attention,and various algorithms have been proposed.However,most existing methods need the specified motif length as the input parameter,while this pa-rameter is unforeseen in a practical situation.To address this issue,this paper has proposed a novel method,called AMDILM,which can automatically identify the optimal motif length.AMDILM iteratively increase the motif's length and then detect the optimal motif with optimal length.Specifically,AMDILM has adapted Genetic Algorithm(GA)to the method,where three operators are used:Mutation,Addition and Deletion.We also propose a criterion to check the accuracy of each motif.We compare AMDILM's performance on both synthetic data and bi-ological data with Gibbs Sampling,MEME and Weeder,and results show that AMDILM can accurately predict motif's length,and discover the optimal motif.(2)Protein is a vital key in organism's survival and development.A better understanding of the essential proteins' biological function and action mechanism is of great importance in diagnosis and pharmaceutical development.Since traditional experimental methods are usual-ly expensive and laborious,computational methods begin to attract attention.Computational methods aim to utilize proteins' biological and topological properties to identify crucial protein-s.Recently,subcellular localization can significantly improve the accuracy of essential protein identification.This paper proposes a novel algorithm,namely SCP,which utilizes subcellular localization information to predict essential proteins.SCP adapts modified PageRank to eval-uate the significance of each protein,and utilizes gene expression profile to calculate Pearson correlation among proteins within their interaction network.Each protein's significant score is calculated based on a combination of the PageRank score and Pearson correlation score.We have conducted several experiments over Saccharomyces cerevisiae datasets to compare with 5 other popular methods.The results show that SCP outperforms the other five methods in terms of accuracy.(3)Visualization is an important research method in microbiome studies,since it can facil-itate researchers with a better understanding about the composition and evolution of biological communities.Due to humans' limitation in understanding high dimensional data,dimension reduction is an indispensable procedure in biological visualization.Multidimensional Scaling(MDS)is a popular dimension reduction method.The basic idea of MDS is to place objects in higher dimension into lower dimension with the between-object distances are preserved as well as possible.In microbiome analysis,Unique Fraction Metric(UniFrac)is a method that calculates distances from a phylogenetic tree,which is constructed from microbiome data,and it can provide distances information for MDS.However,UniFrac distance oriented MDS is com-putationally expensive.This paper proposes a Laplace Matrix based algorithm,named DRLM,to handle with microbiome data dimension reduction problem.In order to evaluate DRLM,we conducted experiments on both synthetic and biological data,and results show that DRLM can not only preserve between-object distances,but also improve the calculation efficiency.
Keywords/Search Tags:Motif discovery, Essential proteins, Microbiome, Visualization, Subcellular localization information, UniFrac distance
PDF Full Text Request
Related items