| The development of high-throughput sequencing technology(RNA-seq)has promoted the development of various biomolecular information that are related to diseases and drugs.The occurrence and development of complex diseases,such as cancer,are regulated by different types of biological data such as genes,non-coding RNA,and proteins.Besides,in the field of drug discovery,biological data such as drug targets and drug side effects are continuously expanded with the launch of various drugs,which makes the drug repurposing technology for the purpose of "repurposed old drugs" get more attention.With the launch of different types of biological data platforms,researchers can obtain information related to diseases and drugs from these biological data platforms,achieve the goal of accurate diagnosis of diseases,and conduct drug repurposing research.However,applying multi-source biological data has brought some new challenges to the research of complex disease diagnosis and drug repurposing.Different types of biological data have problems such as large differences in data dimensions and abundance,and loss of data information in the process of information collection.Existing methods mostly use one single type of data for traditional complex disease diagnosis and drug repurposing,while there is still a lack of related research on the processing and analysis of multi-source biological data.Therefore,the introduction of multi-source biological data can help researchers better understand the pathogenesis of diseases and their potential treatments.This thesis has conducted in-depth research on multisource biological data in disease diagnosis and drug repurposing.The main contents are as follows:(1)For the problem of the dimensional difference and abundance difference between the multi-source biological data and the class imbalance of cancer data samples currently faced in cancer diagnosis,this thesis proposed a hybrid method and ensemble learning framework(TSFS-TCEM)for cancer diagnosis.Firstly,this method integrates two different bioinformatic data,the transcriptome,and functional proteomics data,to construct a fusion omics dataset.Secondly,multiple base models are introduced in the feature selection and ensemble learning framework building process.The three-stage feature selection can effectively deal with the abundance difference of fusion omics while rapidly reducing the dimension,and ensuring the diversity of selected features.The twice-competitional ensemble learning solves the problems of overfitting and insufficient generalization when using a single model in imbalanced data.The experiment results show that the TSFS-TCEM can effectively diagnose unbalanced data compared with other methods.In addition,in this study,the genes and functional proteins related to breast cancer and lung adenocarcinoma screened by TSFS-TCEM were analyzed,and the results showed that the proposed model can effectively identify the causal genes and functional proteins of complex diseases.(2)For the problem that one single biological information source will affect the effectiveness of drug candidate screening for traditional diseases due to its noise or information scarcity,this thesis proposed a Bilinear Constraint Multi-Similarities Matrix Factorization(BCMSMF)method,which constrains the multi-source biological information by fusion similarity matrix and concatenated similarity matrix respectively.Firstly,the drug-disease association matrix data is decomposed into a drug-related feature matrix and a disease-related feature matrix by matrix decomposition.Secondly,use the drugdrug interaction,drug side effects,and other biomedical information to calculate the similarity matrices of multiple drugs and diseases,and then the fusion similarity matrices and the concatenated similarity matrices are obtained.The concatenated similarity matrix retains all the characteristics of the drug or disease,and the fusion similarity matrix can avoid the overfitting problem caused by missing information in some data.Finally,the concatenated similarity matrices and the fusion similarity matrices of drugs and diseases are fused into the BCMSMF model,in which the optimization of the concatenated similarity matrix and the fusion similarity matrix are processed simultaneously.The experimental results show that BCMSMF can effectively integrate multi-source biological data for data repurposing research.In addition,case studies of breast cancer,Parkinson’s disease,and Alzheimer’s disease show that BCMSMF can effectively predict potential drugs related to the disease.(3)The novel Coronavirus Disease(COVID-19)has brought great damage to the world since it was discovered in 2019.For the problem of lacking effective anti-COVID-19 drugs,this thesis constructed a drug-virus database consisting of 34 human infectious viruses and 210 therapeutic drugs b y mining confirmed drug-virus associations in public databases and published literature.Furthermore,this thesis developed a novel nuclear norm minimization(DRMNN)method for repositioning drug analysis from the drug-virus database.First,drug similarity and virus similarity are calculated from the drug molecular information and virus gene sequence information.Second,the virus-drug association,drug similarity matrix,and virus similarity matrix are introduced into the DRMNN method.The experimental results show that DRMNN outperforms other methods in both 5-fold cross-validation and local leave-oneout validation experiments.In addition,this thesis analyzes the top10 potential drugs related to COVID-19 that were identified by DRMNN,of which 6 have been reported to have inhibitory effects on COVID-19 in the relevant literature.The protein docking experiments between the viral spike protein and human ACE2 receptor also provided evidence for predicting the potential effects of drugs on COVID-19. |