Font Size: a A A

Research Of Genome Sequence Analysis Based On Information Entropy

Posted on:2019-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2370330590474191Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Gene sequence analysis is the basic research in genomic informatics and bioinformatics research.In the past twenty years,traditional experimental methods can accurately analyze genomic sequences,but due to its long time,highly cost and the experimental results depending on the actual experimental environment et al.shortcomings,the accessing of computational methods to study the genome sequence becomes inevitable.With the development of various genome sequencing technologies,the number of biological data is increasing year by year,both from the species level and the gene level.Therefore,how to analyze the genome sequence effectively is a problem to be solved.Information theory is a theoretical subject that analyzes and interprets the measurement,transmission,exchange and storage of information.In the analysis of genome sequence,information theory is also a research method.In information theory,information entropy is a measure of the complexity of information.Therefore,in order to analyze the genome sequence by using the theory of information,we mainly analysis the intron prediction in the DNA sequence and the essential gene recognition based on the theory of information entropy in this paper.Numerous essential algorithms and methods,including entropy-based quantitative methods,have been developed to analyze DNA sequences' complexity since last decades.Exons and introns are the most notable components of DNA and their identification and prediction are always state-of-the-art research focus.In the present study,we designed an integrated entropy-based analysis approach,which involves modified topological entropy calculation,genomic signal processing(GSP)method and singular value decomposition(SVD),to investigate exons and introns in DNA sequences.We optimized and implemented topological entropy and generalized topological entropy to calculate complexity of DNA sequences,highlighting the characteristics of repetition sequences.By comparing digitalizing entropy values of exons and introns,we observed that they are significantly different.After we converted DNA data to numerical topological entropy value,we applied SVD method to effectively investigate exon and intron regions on a single gene sequence.Additionally,several genes across five species are used for exon predictions.Our approach not only helps to explore the complexity of DNA sequence and its functional elements,but also provides an entropy-based GSP method to analyze exon and intron regions.Our work is feasible across different species and it is extendable to analyze other components in both coding and noncoding region of DNA sequences.Prediction of essential gene is one of the most challenging and intriguing problems in the field of computational biology.In the eukaryotic genome,about a third of all genes are necessary for life.Prediction of bacterial or prokaryotes essential gene helps to answer the question what are the basic functions necessary to support cellular life.In this work,we develop a featured extracted which based on information-entropy method to analyze and predict essential genes.We optimize the calculation of the generalized topological entropy and generate 6 novel features.Using the 6 features together with other 91 normally used information theoretic features,we apply Xgboost and Random Forest(RF)algorithms to classify the essential genes among 15 selected bacteria.We propose a novel feature-extracted method based on generalized topological entropy to analyze and classify essential and non-essential genes.The experimental results of cross-validation show that the method we proposed can effectively identify the essential genes,and also can be used to predict other functional elements in DNA sequences.
Keywords/Search Tags:information entropy, generalized topological entropy, genomic signal processing, essential genes, classification model
PDF Full Text Request
Related items