Font Size: a A A

The Algorithm Research And Tool Development For MS-based Omics Data Analysis

Posted on:2020-08-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:J TangFull Text:PDF
GTID:1360330599452674Subject:Chemical Engineering and Technology
Abstract/Summary:PDF Full Text Request
Mass spectrometry(MS)is a powerful tool and plays a very important role in biomedical research.MS-based Omics techniques(e.g.proteomics and metabolomics)has been widely applied for the discovery and development of the novel drugs,early diagnosis,treatment and prognosis of diseases,molecular mechanism underlying disease and so on.There are various types of the unwanted variations(including signal drift,experimental and biological variations)in MS raw data,which can be removed using data preprocessing(e.g.,normalization).However,there are different scopes and situations for the various preprocessing methods,which have an influence on the downstream statistical analysis.Moreover,MS data usually with the characteristics of high dimension and low number of samples often brings the overfitting problem to data analysis,which critically affects the reliability of results.Therefore,it is very critical to select the accurate and appropriate analysis algorithms for processing and analyzing the complex,high-throughput proteomics and metabolomics data.Meanwhile,there are several key issues in the omics analysis:(1)the insufficient understanding of the scientific preprocessing of data and the nonuniform evaluation criterion of processing methods;(2)The importance of feature selection methods in selecting features that are truly relevant to biology is underestimated,as well as(3)the unreproducible biomarkers identified using the existing bioinformatics algorithms.Based on the above issues,we performed the systematic research on the preprocessing and downstream analysis for MS data and developed the online tools for evaluating the preprocessing methods using multiple criteria,as well as developed a novel and stable biomarker recognition algorithm.First,the paper integrated the methods based on internal standard and quality control metabolites,proposed a multifaceted assessment strategy for evaluating data preprocessing methods,and developed the first online analysis platform NOREVA that can provide data preprocessing and evaluating for MS-based large-scale untargeted metabolomics data.It not only provides QCM and IS-based methods,but also supports the unique function of data normalization after the signaling shift correction.NOREVA can be freely and friendly accessible in the website http://server.idrb.cqu.edu.cn/noreva/ and http:/idrblab.cn/noreva.Second,the processing procedures including proteins quantitative metrics,quantitative tools,data preprocessing steps(transformation,standardization,missing value imputation)are comprehensively compared.At the same time,the study further proposed a comprehensive score ranking strategy to find the best processing combination methods and constructed the comprehensive assessment interactive online tool ANPELA(http://idrblab.org/anpela/).Compared with the existing online tools,ANPELA not only can automatically detect a variety of data formats generated from popular quantitative tools,but also provides the most complete data preprocessing methods.ANPELA has the unique capability of comprehensively scoring and ranking the processing methods to identify the best one and provides the useful reference and guidance for LFQ.Third,this paper comprehensively compared the 14 popular feature selection methods applied to the LFQ study and demonstrated that there were significant differences in the performance of predictive classification accuracy among different feature selection methods,and each method is very different in the number of selecting true positive proteins.Among these 14 methods,multivariate analysis method(such as PLS-DA)showed better performance in screening the true positive proteins and predicting classification ability than the others.In general,the choice of feature selection method not only should consider the above two factors,but also need to combine with research purposes.At last,a novel feature selection algorithm was proposed and developed in this paper and applied to untargeted plasma metabolomic profiling of pituitary tumor for identifying the high stable potential biomarkers.The novel algorithm combined repeating random sampling with consistency score and evaluated the consistency of differential metabolic features among different data sets.Compared with the traditional feature selection methods,this novel strategy showed superior stability and distinguishing ability.Based on the untargeted metabolomics of pituitary adenomas,45 highly robust differential metabolites associated with pituitary adenoma were identified in this study.These potential candidate metabolites indicated the dysregulated lipid metabolism in pituitary adenomas.This study provided an important scientific basis for revealing the complex pathological mechanism of pituitary adenomas.Overall,this paper systematically analyzed the preprocessing methods for MSproteomics and metabolomics and developed the online platforms(ANPELA and NOREVA),which could provide the important and valuable reference and guidance for preprocessing proteomics and metabolomics data.Moreover,a comprehensive research on the performance of feature selection methods in both predictive classification ability and the number of selecting the true positive proteins would provide useful reference and guidance for choosing the ideal feature selection method to identify the accurate and reliable biomarkers.Furthermore,we developed a novel feature selection algorithm,which provided reliable algorithm resources for identifying stable biomarkers.
Keywords/Search Tags:Proteomics, Metabolomics, Data Processing, Feature Selection, Web-based Tool
PDF Full Text Request
Related items