Font Size: a A A

Research On The Relationship Between Genes And Diseases Based On Data Mining

Posted on:2020-02-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J LiFull Text:PDF
GTID:1364330623451655Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Using advanced data mining methods to analyze the relationship between genes and diseases can help to discover the mechanism of disease occurrence,thus providing a scientific basis for the diagnosis and personalized treatment of diseases.However,the data of gene-related information usually has high-dimensional small samples,high noise,and high redundancy,which makes many excellent data mining and machine learning methods less effective in processing genetic and disease-related data.Therefore,it is necessary to design a suitable algorithm model for data analysis based on the characteristics of the relevant data of specific genes and diseases.In this paper,based on the characteristics of different genes and disease-related data,the relationship between genes and diseases was analyzed from three aspects: gene function prediction,characteristic gene selection and relationship between miRNA and disease.A series of corresponding data mining methods are proposed to deal with them.The main research contents and innovations of this paper are as follows:1)Gene function prediction is in fact a multi-instance,multi-label problem.In this paper,a machine-based approach is used to explore multi-example multi-label objects to annotate unknown gene functions.This paper combines hierarchical clustering with multi-label learning framework,and proposes a multi-label hierarchical clustering algorithm framework based on gene ontology hierarchy.This article turns the problem of multiple example multi-tabs into a relatively simple single-example multi-label problem.This algorithm is based on the correlation between gene expression,and supplements the corresponding clustering methods according to the maximal correlation of functional classes between genes,and constructs genes with similar gene functions as multi-sample data sets.Finally,in order to verify the effective performance of the proposed algorithm,this paper verifies in the three yeast expression datasets,firstly transforms the multi-example multi-label gene function prediction problem by using the multi-instance hierarchical clustering method based on gene ontology hierarchy.Degenerate it into a single-example multi-label problem,and then use the multi-label K-nearest neighbor algorithm(MLKNN)or the multi-label support vector machine algorithm(MLSVM)for modeling and functional prediction.From the experimental results,it can be found that in the problem of degrading the gene function prediction problem into a single-example multi-label problem,the proposed algorithm can maintain the correlation of the relationship between genes well,and it has better performance.2)Feature selection is actually a dimension reduction technique,which is a necessary data preprocessing for high-dimensional data.The purpose is to select as few feature subsets as possible to express the feature set in the full feature set of high-dimensional data.In the field of gene expression profiling data analysis,since the gene expression profiling data has many dimensions(characteristics)and very few samples,the data mining analysis task with this characteristic data set is very likely to cause dimension disaster,so For the analysis of gene expression profile data,feature selection techniques have become almost a n ecessary data preprocessing step.Therefore,there is a term called feature gene selection,which means that the feature selection method is applied in the analysis of gene expression profile data.This paper proposes a new method for the selection of characteristic genes for gene expression data.This method is called semi-supervised method for maximizing local edges,referred to as SMLM.The method constructs a local nearest neighbor graph by using the local structure,and divides the information into local nearest neighbor graphs within and between classes by weighing the edges between the two data points.In order to verify the performance of the proposed algorithm,the classification verification was carried out on t he four gene expression spectrum datasets.The experimental results show that SMLM has good stability and classification accuracy.3)Developed an inductive matrix completion model for MiRNA-disease association prediction(IMC-MDA).Studies of potential miRNA-disease association predictions will help us understand the pathogenesis of disease and promote disease treatment.However,the use of biological assays to identify disease-associated miRNAs is time-consuming and labor-intensive and not targeted.In view of the existing deficiencies in the computational models for predicting disease and miRNA,such as the accuracy is not ideal,and the effective model requires negative samples,there is an urgent need for simple and effective new calculations for predicting disease-associated miRNAs.model.In this paper,an inductive matrix completion model for miRNA and disease association prediction is developed,referred to as IMC-MDA.In the model of IMC-MDA,known miRNA-disease associations and integrated miRNA similarities and disease similarities were combined to calculate a predicted score for each miRNA-disease pair.Based on LOOCV,IMCMDA has an AUC of 0.8034,which shows better performance than previous methods.In addition,experiments have demonstrated disease-related miRNAs for five major human diseases: colon,kidney,lymphoma,breast,and esophageal tumors.
Keywords/Search Tags:Gene function annotation, Multiple examples and multiple tags, Feature selection, Gene expression profiling, Matrix completion, MiRNAs similarity, Disease similarity
PDF Full Text Request
Related items