Font Size: a A A

Machine Learning Researches On Pan-cancer Gene Pathway And Regulation Of Chromatin Insulators

Posted on:2020-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:C R LiuFull Text:PDF
GTID:2370330578951278Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
With the development of DNA sequencing technology,people have more and more means of obtaining DNA sequence and gene expression data.The key to bioinformatics is to extract valuable information from a large number of biological sequence data.While the traditional method of sequence data analysis analyzes the mutation of the sequence and its positional information in the genome through comparison,mapping,etc.,the expression data analyzes the differential expression of different genes and tries to find out the law.However,these methods can only obtain the properties of the data itself,and they fail to mine and show the hidden rules.In recent years,machine learning has been widely applied in the areas of data mining,personalized recommendation,natural language processing,image recognition,etc.Through different types of supervision methods,it weights features and extracts the features containing high generalization.Previous machine learning analysis of bioinformatics data was mainly problem-oriented and solved the problem of data classification.However,the machine learning algorithm cannot be related to biological significance,and the generalization performance of the model for biological data can only be judged by the classification evaluation index.In this paper,two sets of experiments are designed(TCGA gene expression data predicting pan cancer,gene pathway prediction and insulator sequence prediction)to mine different biological data,and to support the generalization performance of algorithm on biological data according to biological significance.The Cancer Genome Atlas(TCGA)has collected 33 common cancer sample data from more than 11,000 patients,including expression data,mutation data,and methylation data.Predicting the pan-cancer gene pathway based on TCGA gene expression data,people have the ability to diagnose cancer earlier and figure out the relationship between gene expression and activation of cancer pathways.Insulators play an important role in regulating gene expression.When the insulator is located between the enhancer and gene,it blocks or reduces the activation of the enhancer on gene expression.So component like this plays a significant role in gene therapy because it can prevent genotoxicity and gene mutation,improve the safety of gene therapy,accurately predict and identify the insulator component module so as to cut the verification cost and improve the accuracy of prediction.Thus the results of the two experiments are of great significance and the main contributions of this paper are:1)A pan-cancer gene pathway analysis framework XBPCPA has been proposed.The machine learning XGBoost algorithm was used to integrate more than 180 million feature points of more than 9,000 samples,and analyze the effect of pan-cancer gene expression on activating pathway.The threshold control hyperparameter is designed to control the classification boundary of positive and negative samples,solve the problem of sample imbalance in the data,and improve the classification evaluation parameters AUC and AUPR.The comparison experiments show that the XBPCPA framework is more generalization for the prediction of cancer pathway.2)Based on the semi-supervised deep learning algorithm Ladder,a bio-insulator prediction algorithm Ladder-Seq has been proposed.It solves the problem of deep learning and training of biological data in the case of small samples of sequence data tags.The model uses convolution to modify the ladder and make it suitable for DNA sequence data.And the DNA sequence data has good convergence performance through model design and parameter optimization.3)Based on the in-depth study of the characteristic action patterns related to biological data classification tasks,a weight adjustment strategy related to biological significance weights has been put forward.In the experiment of gene pathway prediction,the nodes of the spanning tree are used to represent the relationship between gene expression and gene pathway activation,and the matrix of the convolution kernel weight of the first layer of the ladder represents the motif in the insulator sequence.In the pan-cancer pathway prediction experiment,a large number of important gene expressions have been found,and they have been proved by the published data.Correlational researches are of great significance for the early diagnosis of pan cancer.
Keywords/Search Tags:TCGA, bioinformatics, XGBoost, ladder, gene pathway, insulator
PDF Full Text Request
Related items