Font Size: a A A

Classification Method Of DNA Sequence Based On RBF Neural Network

Posted on:2010-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:X N SunFull Text:PDF
GTID:2178360272995985Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Base sequence in DNA recorded all directions about Growth and development of life,is a book that human have not yet understood. It is simply made up by four kinds of bases:adenine(A),cytosine(C),guanine(G) and thymine (T).This long-chain not only contains all the information about manufacture of human protein, also the specific spatial and temporal pattern of assembly these proteins as a four-dimensional control of the information organisms.Human genetic code contains 3.2 billion characters, but only constituted by the four bases,they are neither lexical nor syntax, how to read them is a very big problem.In the early days,the primary means of gene identification is based on living cells or experimental biology. Now,because researchers have obtained a tremendous amount of genomic information, relying on the slow experimental analysis can not meet the needs of gene identification, and gene identification algorithm based on computer has been considerable development, become the major means of gene identification.the key of gene identification research is how to develop a rapid and efficient algorithm, improve the accuracy of gene identification and performance.This article explores a DNA sequence classification method based on RBF neural network.First,each DNA sequence is mapped into a meaningful feature vector,these vectors reflect the characteristics of DNA sequences from different angles and constitute the optimal feature set of DNA sequence.Second,put the optimal feature set into network,keep it training and amendment.When the network training is completed, put the optimal feature set of unknown sequence into the network.At this time ,the output is the class of unknown sequence.About how to select the optimal feature set,this article considers the percent of base and the arrangement of base.First:Feature extraction method based on base percentFrom the view of DNA sequence,a lot of strings are duplicate,and some strings in different classes have emerged in different numbers.that is to say strings can be thought as a feature of different classification.In this paper,we select single base contents and double bases contents as the feature of DNA sequence.The contents of single base:It uses the percentage of A, G, C, T as the characteristics of a sequence.If the percentage of A, G, C, T were recorded as pa, pg, pc, pt,we can use a four-dimensional vector sequence (pa, pg, pc, pt) to represent the characterization of the sequence,and there is pa + pg + pc + pt = 1.After listing all contents of single base,we select some typical features as the first sellection of the optimal feature set.The contents of double bases:It uses the percentage of double bases as the characteristics of a sequence.A,G,C,T four bases,two of them form a group,and there are 16 kinds of combinations.So each DNA sequence can be represent by a 16 dimensional vector.Taking into losing or adding base can impact the characteristics of a sequence.In this article,we make a base to shift to right,and obtain double bases.After listing all contents of double bases,we select some typical features as the second sellection of the optimal feature set.Second:The feature extraction method based on arrangement of sequences We explores a new feature extraction method—after the DNA sequence is expressed by 4D method,we select the average of all samples and the slope of fitted straight line in XOS and ZOS coordinate planes as the new characteristics of a DNA sequence. The average and slope Constitute the third sellection of the optimal feature set.For a training sequence,we obtain the optimal feature set based on single base,double bases and 4D express information.What we should do next is constructing and training Radial Basis Function (RBF) neural network according to the optimal feature set,and then getting the classifical result of a unknown sequence.RBF neural network is presented by J.moody and C.darken.In general, it contains three layers:input,hidden and output layers.Input layer nodes only pass the input signal to hidden layer,hidden layer nodes is made up by the radial non-linear function,and the output layer node is usually a simple linear function.The basis function of hidden layer nodes responses the input signal in local area.That is to say,when the input signal is closer to the center,it will give higher output.This just reflects the characteristics of the cerebral cortex layer.we can say that the network has the capacity of local approximation.In this paper, Gaussian function is selected as the basis function for hidden layer of RBF neural network and newrbe as the design function.The number of input layer neurons equals the number of input samples and the number of output layer neurons equals the number of desired output.When the network is built up,the process of training network should be determined by Gaussian function mathematical expectation,variance,the hidden layer and output layer neurons weights and thresholds.When the hidden layer and output layer neurons weights and thresholds are identified,the network output will be identified. So the training process of RBF network is the amendment process of layer weights and thresholds.Combined with the specific circumstances of this article,it is necessary to build the network using a known training sequence,and then determine the unknown sequence belongs to which class. The concrete steps are following:first,build a RBF neural network in accordance with the number of input and output samples.second,put the optimal feature set into network,keep it training and amendment.third,when the network training is completed,put the optimal feature set of unknown sequence into the network.at this time,the output is the class of unknown sequence.In this paper,artificial sequence and the actual sequence of biological are selected in the experiments.finally,we both get a very good classifical results,this proves the method this article proposed is feasible and effective.
Keywords/Search Tags:feature extraction, DNA sequence, RBF neural network, classification
PDF Full Text Request
Related items