| With the rapid advance of high-throughput sequencing technology,the explosive growth of omics big data has greatly contributed to our understanding of cancer at the molecular level.The new challenges presented by massive amounts of omics data are data processing and analysis.Omics big data is complex,multi-layered and complementary so that a key goal of analyzing these data is to build effective models that predict phenotypic characteristics and elucidate the biological significance of important biomarkers.Omics big data is characterized by multiple high-dimensional heterogeneous data from multiple omics sources with inherently high noise,and many features are disease phenotype-irrelevant and redundant,which lead to many traditional data analysis methods cannot be used directly to analyze omics big data.Dimensionality reduction is an effective solution to overcome the curse of dimensionality effect in high-dimensional omics data analysis,and can significantly reduce computational and memory storage requirements.Dimensionality reduction is beneficial since it tends to reduce the risk of model overfitting.In the post-genomic era,machine learning methods have been widely applied for predictive modeling and data mining in bioinformatics.Feature selection is a method of dimension reduction in which a subset of relevant features is selected for model building.It has been proven to be efficient and effective in dealing with highdimensional data,and it has been widely applied for biomarker discovery in bioinformatics.One of the major merits of the feature selection is that it retains the physical meaning of the original feature set and provides better model readability and interpretability.Since the inputs of the traditional machine learning method are usually vectors,how to transform a biological sequence to the vector of numeric features was one of the most challenging issues in computational biology.Utilizing machine learning methods for identifying disease-causing rare variants can lead to a better understanding of the molecular mechanisms underlying a complex trait.Although high-throughput sequencing technology can accurately and comprehensively measure biomolecular properties at different molecular levels,each omics data is restricted by the biological function of the corresponding molecule-level in the biological system.Most of the existing feature selection algorithms were designed for the single-omics data.The various resource of different omics data make them suggest diverse data distributions,so that directly merging multi-omics data will further increase the dimensionality of the data and reduce the signal-to-noise ratio.Therefore,effective integrative analysis of multi-omics data can make better use of massive amounts of omics data to achieve the data-driven biomedical research,and promote our understanding of the mechanisms of disease development,and advance the development of new strategies to early diagnosis,prevention and adjuvant treatment of diseases.In summary,the main contents of this research are presented as follows:1.For the classification of breast cancer histologic grade,a four-step biomarker detection algorithm,namely BioDog,was proposed to find the methylomic features that had a good identification power of the histologic grade of breast cancer.A correlation bias reduction strategy was applied to deal with the problem of correlations between features.Breast tumor characteristics of histologic type and histologic grade in TCGA,and gene mutation analysis were applied to investigate how histologic type and histologic grade intersect with each other and the differences of the somatic mutations under different histologic grades.The results of performance comparison experiment demonstrated that BioDog outperformed the existing 17 biomarker recognition algorithms.2.For the classification of breast cancer intrinsic subtypes,an efficient logistic regression-based multi-omics integrated analysis algorithm,namely ELMO,was proposed to integrate the RNA-Seq and DNA methylation data based on a four-step feature selection process to detect the breast cancer intrinsic subtypes.The experimental data suggested that multi-omics models outperformed the single-omics ones,and the separately pre-processed individual omics datasets before integrative analysis can improve the model performance.The results of the performance comparison experiment demonstrated that ELMO outperformed the existing 19 biomarker recognition algorithms.The detected 42 biomarkers demonstrated functional associations with different subtypes of breast cancers and good prognosis prediction performance.3.For the analysis of association of rare variants,this paper presents the improved ZFA(Zoom-Focus Algorithm)algorithm,namely ZfaParallel,to support sharedmemory parallel computing in order to improve the efficiency of existing ZFA algorithm.We also developed a sequential backward search method to further identify rare variants associated with phenotypes,and an updated focus method was developed to allow search for adjacent regions.We demonstrate experimentally that the parallelized ZFA method can perform significantly improved computational efficiency.4.For the prediction of DNase I hypersensitive site(DHS),we provide a comprehensive survey of 10 state-of-the-art computational methods for DHS identification in human genomes.These computational methods were discussed and compared in terms of feature extraction,feature selection,classification algorithms,predictive performance evaluation and practical utility.Subsequently,we have developed a novel predictor named SeqRefine to identify DHS in human genomes based on K-mer features extracted from DNA sequences.The experiments on the benchmark dataset have demonstrated that SeqRefine was significantly better than published methods for identifying DHS.A user-friendly local software also has been developed for DHS prediction.This paper focuses on the application of machine learning methods for predictive modeling and data mining on omics big data.From the perspective of computational methodology,explored biomarker identification,multi-omics data integrated analysis method,and constructing predictive models of phenotypic traits.These methods have essential application values since they select relevant features by eliminating irrelevant and redundant features for high-dimensional omics data and construct an effective prediction model of phenotypic traits. |