Recognition Of Protein Coding Sequences Based On Graphical Representation

Posted on:2012-11-20

Degree:Master

Type:Thesis

Country:China

Candidate:J Yan

Full Text:PDF

GTID:2230330395485619

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Genome sequences rich in biological knowledge and biological principles. Withthe development of Human Gene Groups (HGP) and the fast increasing pace of thegenome-sequencing projects, biologists have got genome sequences of hundreds ofspecies. Recognition of protein coding genes is the first problem in genome analysisafter the genome-sequencing. This paper describes some new approaches forrecognition of protein coding sequences, especially short coding sequences, andanalyzes it from graphic features and classification algorithm.According to base bias in the three positions of codon and base chemicalproperties, new graphical representations of gene sequences, are introduced forrecognizing short coding sequences of human genes. Nine effective features of areamatrix are extracted in the new curves and Support Vector Machines (SVM) is used toidentify the short protein coding sequences in human genes. In the process ofidentifying, the incremental feature selection algorithm is used to add four statisticalfeatures to express more information and improve the accuracy. Then PrincipalComponent Analysis (PCA) is worked for reducing dimensions. Finally, theexperimental results show that the method uses fewer features (seven or four) and getsbetter recognition results than other methods.Traditional Support Vector Machine (SVM) is sensitive to isolated point andnoise data, and has huge calculation. To improve this weakness, Least Squares FuzzySupport Vector Machines (LS_FSVM) is applied for classifying the coding/uncodingsequence instead of SVM. A new calculation method of the sample membership forLS_FSVM is proposed, in which the relation of samples has been taken into account.Compared with SVM and Least Squares Support Vector Machines (LS_SVM), thismethod obtains better recognition accuracy.

Keywords/Search Tags:

Graphical representation, Coding/non-coding region, Identification ofshort protein coding region, Gene identification, DNA, Least squaresfuzzy support vector machine, Membership function, Support vectormachine

PDF Full Text Request

Related items

1	Function Annotation Of Long Non-coding RNAs Based On Multi-omics Data
2	Identification Of Protein Coding Region Based On Artificial Neural Networks
3	The Research Of Prediction Method Of Protein Structural Class Based On Linear Predictive Coding Of PSI-BLAST Profiles
4	Identification Of Plant Long Non-coding RNA Based On Sequence Energy Score Difference Method And SVM Algorithm
5	Prediction Of Long Non-coding RNA Subcellular Localization
6	Coding And Non-coding Sequence Analyzing Based On Z-curve Theory
7	Evolutionary Analysis Of Bacteria And Virus And Identification Of Eukaryote Coding Region Based On New Algorithm NAAKV
8	Phylogenetic Analysis And Non-coding RAN Prediction Based On Information Theory
9	Evaluation Of Gene Structure Prediction Programs And Prediction Of Translation Initiation Sites
10	Construction Of SNP Map In Regulatory And Coding Region Of TLR4 Gene In Chinese