Font Size: a A A

Study On Relevant Problems Of Biomedical Data Mining Based On Machine Learning

Posted on:2021-05-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y WuFull Text:PDF
GTID:1364330605469580Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
There is an urgent need to improve disease prevention,diagnosis,treatmen-t,prognosis and rehabilitation.The Human Genome Project and the Human Microbiome Project has offered an increasing body of experimental and clinical evidence on the association between diseases and microbial flora changes in the human body.Ther are massive data on protein sequences in the post-genome era.However,protein function studies and the association between diseases and microbial flora through in vitro experiments still cannot meet the requirements for rapid mining of relevant information.It is therefore imperative to develop advanced and efficient computational models to identify the function of unknown proteins and predict human disease-microbe associations.With the rapid devel-opment of artificial intelligence,machine learning,data mining theory and the improvement of computer computing power,medical data mining has become a hot research topic.Although computational biology has made great progress in the research of protein sequence alignment and the prediction of human disease-microbial associations in the past decade,much remains unexplored in this area.Based on data mining theory and machine learning theory,this dissertation the-sis investigates related problems of protein sequence analysis,therapeutic peptide recognition,and prediction of human microbe-disease association.The detailed contents of the research are as follows.(1)With the development of sequencing techniques,increasing information on multi-omics sequencing data has become available.Many researchers stud-ied the sequence similarity analysis of proteins sequences and identification the function of unknown sequences.Albeit progress in identification of the function of unknown protein sequences,there is less research based on the dynamic char-acteristics and nonlinear characteristics of protein sequences.Here we apply an approach based on graph theory,and calculated the spectral radii of amino acids(AAs).A new graphical representation of protein sequences is proposed based on the spectral radii of AAs.And features reflecting static and dynamic characteris-tics of protein was then extracted.In order to measure the similarity,Cosine and Gaussian kernel similarity were adopted to calculate the distance.The proposed method might help to tidentify unknown protein functions.(2)In this study,a computational tool(PTPD)based on deep learning theory and Word2vec method were designed to identify anticancer peptides(ACPs).Based on the co-existence information of k-mer,the embedding vectors of all k-mers were obtained by Word2vec.With theses embedding vectors,the peptide sequences were mapped to the input layer.Feature maps were constructed by multiple filters in the convolutional layers.To avoid over-fitting,dropout and pooling operations were designed in the framework.A sigmoid function was introduced to generate the classification probabilities.The proposed model was then validated on two independent datasets.The results showed the validity of PTPD.It may therefore contribute to identification and design of new therapeutic peptides.(3)Understanding the association between microorganisms and diseases is cru-cial for disease pathogenesis study,prevention and treatment.To aid experi-mental validation,a computational model was proposed to predict novel human disease-related microbes.First,a disease-microbe network was established and an extended random walk algorithm was proposed to obtain the disease-microbe as-sociation probability.Second,the optimal model was obtained by particle swarm optimization algorithm.The results of cross-validation and case studies validated the proposed model.It showed that it is effective to identify the disease-related microbes.(4)Due to the limited information obtained from one single database for the s-tudy of human microbe-disease association,we integrated symptom-based disease feature,and described the association of disease and microorganisms by different methods.The prediction model was obtained by matrix completion algorithm with the supervision of known associations.The results of cross-validation and case study showed that it is an effective tool.The proposed method will help to identify novel disease-microbe associations,clarify the mechanisms behind micro-bial therapy,and thus guide the development of medicines and healthy foodsIn summary,this thesis presents innovative application of graph theory to calculate protein sequence similarity on the amino acid spectrum.Based on Word2vec and deep learning theory,the therapeutic peptide recognition model was constructed,and an extended random walk on heterogeneous network was proposed.The model which was optimized by particle swarm optimization pre-dicted the human disease-microorganism association,incorporated a variety of in-formation and used the matrix completion algorithm to generate a human disease-microorganism association prediction model.The calculation models proposed in this dissertation will potentially be applied to study the function of extracting protein sequences,and to assist the understanding of the relationship between microorganisms and human diseases.This study will hopefully contribute to the early detection and treatment of diseases,and promoting novel therapeutic target discovery and development.
Keywords/Search Tags:Feature extraction, Rrandom walk algorithm, Matrix Completion, Deep learning, Human microbe-disease association prediction
PDF Full Text Request
Related items