Font Size: a A A

Research On Non-coding Gene Function Annotation Methods Based On Data-driven

Posted on:2016-12-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z X MaFull Text:PDF
GTID:1228330467993952Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the past15years in the21st century, the Human Genome Project, Encyclopedia ofDNA Elements, and a series of international cooperation projects closely related to thegenetic information have been successfully implemented. Research results of these projectsshow that the genetic code of life is made up of protein-coding genes and non-coding geneswith complex regulatory functions. Non-coding genes have an important role intranscriptional regulation, epigenetic regulation, cell cycle regulation and cell differentiationregulation and many other life activities. They are closely related to occurrence anddevelopment of complex diseases. Protein-coding genes are essential raw materials to thesurvival of species, but Non-coding gene guides the development direction of life at a lowertranscription level. Identification and functional annotation of non-coding genes is a popularresearch field of genetic information currently.Massive bio-chip data, which is frequently found in public databases and fragmentedliterature, is a "knowledge base" of cognitive genetic information. These data are usually notcomparable because of differences in experimental background. In the meanwhile, majorityof the data are simply statistics processed then put aside because of the lack of reliablemathematical analysis tools. With the rapid development of information technology andadvances in basic theory and techniques of various disciplines, the bioinformatics whichuses mathematical models as theoretical basis and calculation methods as technical meanshave gradually caught on. It brings the possibility for mining valuable but unapparentbiological data.This paper focuses on the identification and non-coding gene function predictionmethod. It proposes a function that predicts non-coding genes in the genomic. This functionis driven by different microarray data and aided by the biological networks constructed bycalculation methods. Specific tasks include the following three aspects:(1) Research and analysis on non-coding genes biometric characteristic: although thereare a variety of evidences which prove that non-coding sequences of different geneshave important biological functions, in most cases the function of one particular non-gene encoding is not clear. This paper focuses on functional identificationmethod of non-coding genes from two perspectives based on biological principle andcalculation prediction. It analyses non-coding genes and secondary structurecharacteristics of long non-coding RNA. This paper found that non-coding genecodon is mostly biasedly distributed and has poor sequence conservation. Thenumber of long stem-loop structure in secondary structure of its transcript is morethan the number of this structure in coding sequence. These biological statisticsfeatures can be used to identify non-coding genes and their functions. It is thefoundation of follow-up work in this paper.(2) Proposing genome-wide non-coding genes functional prediction method: bycalculating and analyzing human genome microarray Affymetrix Human GenomeU133A (GEO No. GPL96), this paper determined that there are a large number ofprobe annotation errors in this chip. HG-U133A25,000probes targets14,500humangenes. We found that41%of the probes non-specifically match a plurality of genesequence.9%of the probes do not match any of the gene sequence encoding.According to the non-coding genes characteristic findings, we proposed a two-toneco-expression network construction method basing on coding and non-codingmicroarray data. It reflects functional association of coding and non-coding gene.Take HG U133A chip as an example, we use this method to re-annotate1120non-coding genes in250,000probes and to set CSF value to less than300. ThePearson correlation coefficient mean of coding and non-coding gene pairs is lessthan2.20e-16. After enriching gene function, the function of these1120non-codinggenes involved in tissue and organ development, intracellular transport, and othermetabolic processes.(3) Proposing genome-wide non-coding housekeeping gene prediction methods:housekeeping gene are kinds of constitutive gene which maintains basic function ofcell. They are usually expressed in all tissue types and cell stage. This feature allowsthe housekeeping genes to be used as reliable reference of chip normalizationoperation. In order to make chip data from different backgrounds comparable, thispaper proposed a method based on Fourier analysis. It converted series data fromgene microarray into Fourier spectrum. Through supervised learning method SVM,it extracted the salient features of Fourier spectrum and identified noncodinghousekeeping genes. Using this method, from human Hela cells and time-series data sets GSE361GSE1133which contain115sets of data, we extracted24cyclefrequency characteristics. Using these features, we predicted510housekeepinggenes, in which93non-coding housekeeping genes are contained. Comparativeexperiments show that this method can completely cover the positive data from thethree currently publicly reported data sets and has a low false positive rate.Bioinformatics which uses computer as main analytical tool can provide valuablereference information and correct guidance for specific biological problems and biologicalexperiments. It can also reduce the input of manpower and material resources in large-scalebiological experiments and accelerate the research process. While solving biologicalproblems, it enriches the connotation of algorithm research and expands the area ofapplication, and is of important theoretical and practical value two the two disciplines. Inthe future the scale of timing bio-chip data will be larger. Obtaining more diverse data bythis method is bound to gather more reliable biological evidence. Network model andprediction algorithm proposed in this paper can not only better solve problems ofidentification and functional annotation of non-coding genes, but also has referencemeaning to similar data analysis in other areas.
Keywords/Search Tags:genes, non-coding RNA, function prediction, classification algorithms
PDF Full Text Request
Related items