Font Size: a A A

Function Annotation Of Long Non-coding RNAs Based On Multi-omics Data

Posted on:2018-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:J C LiFull Text:PDF
GTID:2310330536986052Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective With the development of sequencing technology,more and more long noncoding RNAs have been identified in mammals,but the function of most lncRNAs are unknown.In view of the important regulatory role of lncRNAs in many biological processes,functional prediction of lncRNAs has become a hot spot for biologists and bioinformatics.Computation based prediction is one of the major methods in the field of functional annotation of lncRNAs.At present,only a little high throughput data can be used in the functional prediction of lncRNAs,of which the co-expression network constructed from expression profiles is most widely-used.This study predicts the function of lncRNAs based on multi-omics data such as epigenetic modification data and transcription factor data,and explore the feasibility and predictive performance of different data sources.Methods This study first constructed co-expression network,collected epigenetic modification and transcription factor data.By use of statistical learning theory based support vector machine(SVM)algorithm,resampling technique and ensemble method,we constructed training data set based on multiple data sources,extracted and selected features,trained modle,evaluated model,predicted functions of lncRNAs,then integrated predicted results,and get the final function of lncRNAs.SVM algorithm based on LIBSVM software package,data preprocessing and other related implementations based on Perl,R software.Results The average AUC of three SVM models respectively based on co-expression network,epigenetic modification and transcription factor data are 0.662,0.638,0.609,among them the co-expressing network based predictive model got the best performance,while epigenetic data,transcription factor data obtain more lncRNA with GO terms annotation.There are 32,1,441,6,637 lncRNAs obtaining GO terms annotation for three data source,respectively.After result integration,7,036 lncRNAs obtained predicted functions,and about 203 GO annotations are predicted for each non-coding genes.Conclusion Ensemble and under-Sampling has the advantage of avoiding class imbalance problems,improving prediction model performance,reducing information loss and reducing computing time cost in theory.Different data sources can provide different information for functional prediction.Because of the complexity of biological systems,the biological mechanisms involved in gene function are numerous,individual data source can not provide completely information for function prediction.The integration of multiple data sources can effectively solve this problem,the use of machine learning methods also provides an effective tool for gene function annotation.In addition to epigenetic data,transcription factor data,perhaps more data source will be integrated into the field of lncRNA functional prediction.
Keywords/Search Tags:Long non-coding RNA, Gene function prediction, Statistical learning, Support vector machine
PDF Full Text Request
Related items