Font Size: a A A

Long Non-coding RNA Identification Utilizing Machine Learning Methods

Posted on:2019-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:S Y HanFull Text:PDF
GTID:2428330548461235Subject:Engineering
Abstract/Summary:PDF Full Text Request
Long non-coding RNAs(lncRNAs),one kind of transcripts that are longer than 200 nucleotides and unable to encode proteins in the intracellular space,have been at the forefront in recent years.Currently,only a small fraction of lncRNAs have been analyzed,but scientists have discovered a wide range of biological processes that lncRNAs involved.With the rapid development of next-generation sequencing technologies,thousands and thousands of transcriptomes have been discovered,which furnished us with more and more useful information on lncRNAs.Discovering new long non-coding RNAs have been the fundamental process in lncRNA-related research.Several lncRNA identification tools have been developed and almost all of the tools are machine learning-based,which reflects the importance attached by the scientists to this problem and the enormous potential of machine learning in biological problems.This study mainly focusses on machine learning-based lncRNA algorithms.And our research can by summarized as follows.Analyzing classic tools is the first part of this study,which can provide an overview of current research progress and present the essential information used by these methods.Having comprehensively analyzed and evaluated several widely-used tools,each tool's merits,drawbacks and application scopes are summarized with the aim of assisting researchers to select the most appropriate tools and obtain the optimal results under different circumstances.In the second part of the study,a novel computational lncRNA identification method Lncident(LncRNA identification)is conceived.Lncident is a support vector machine-based lncRNA identification tool,and it identify lncRNAs by calculating k-adjoining bases frequencies on open reading frameregion.Compared with popular tools,Lncident enjoys a considerable enhancement on accuracy and achieves more robust performances on multiple species.Lncident is published as R package as well as web server to maximize its availability.The web server is available at http://csbl.bmb.uga.edu/mirrors/JLU/Lncident/.The R package of Lncident can also be downloaded from the above link.Almost all lncRNA predictiors identify lncRNA with sequence-derived features only.However,the discriminative power is limited when the features are extracted only from one perspective.In the third part,a comprehensive feature exploration is conducted and a novel algorithm is proposed for lncRNAs identification and analysis.The features of the new algorithm are based on three innovative heterologous feature groups,namely,Logarithm-Distance of hexamer,multi-scale structural information and Fast Fourier Transformation-based physicochemical property.19 critical heterologous features are obtained using 10-fold cross validation and feature selection.Five widely-used machine learning algorithms: logistic regression,support vector machine,random forest,extreme learning machine and deep learning are also evaluated to determine the optimal classifier.Experimental results display that our algorithm outperforms several state-of-the-art tools on multiple species with the most robust and satisfactory results.In the third part,an integrated package LncFinder is also developed.The majority of existing tools for lncRNA identification and analysis cannot be re-trained or tailored for specific species.In addition,various features employed by different methods can reveal the properties from different perspectives but these features cannot be customized by users.Employing LncFinder,users can extract various classic features,build classifiers with numerous machine learning algorithms and evaluate features performance effectively and efficiently.User can also utilize the algorithm designed in the third part to predict lncRNAs.Both the R package and web server are developed for Lnc Finder.The web server of LncFinder is available at http://csbl.bmb.uga.edu/mirrors/JLU/lncfinder/LncFinder.The R package of Lnc Finder has been indexed by the most authoritative and comprehensive R package library CRAN(Comprehensive R Archive Network).Users can simply install the R package of LncFinder by entering only one command in R: install.packages("LncFinder").And an appropriate version will be installed automatically.Users can also download the R package from the canonical index https://cran.r-project.org/package=LncFinder.In this study,an integrated lncRNA identification platform is developed.Instructive information on lncRNA identification tools and features are summarized and discussed.Two novel lncRNA identification algorithms are also proposed.In the comprehensive exploration,various classic features are evaluated and new features are also designed from the persective of sequence intrinsic information,multi-scale secondary structural information and Fast Fourier Transformation-based physicochemical property.These features are expected to play a positive role in other lncRNA-related research,such as interaction,annotation and evolution.And it is anticipated that this study can greatly facilitate lncRNA-related research and lncRNA properties analysis.
Keywords/Search Tags:long non-coding RNA, identification, machine learning, classifier, performance evaluation
PDF Full Text Request
Related items