Font Size: a A A

On The Application Of The Methods Of Poly(A) Site Identification For Model Plant Sequence

Posted on:2008-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y K ZhangFull Text:PDF
GTID:2120360242978821Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
With the plans developed in genome sequencing and a breakthrough in the measurement of molecular structure, large quantities of biological data on genes have been accumulated by the biological world. Biologists are required to analyze and process the continuously increased biological data with the new bioinformatics algorithms and tools so as to make full use of the data. Therefore, the technology of data mining has a huge potential in prediction of gene function and discovery of new genes. This present thesis is a study on the clustering of sequence through data mining, and identification of the sequence of the position of poly (A) valuable clustering, which can serve as the first step of research in gene expression data.This thesis presents a method to identify the model plant Arabidopsis poly(A) site based on the Self-Organizing Map (SOM). SOM is a widely used unsupervised learning neural network in fuzzy clustering, which adjusts the indefinite weight through a huge number of Self-Organizing training data. The visualization of SOM can judge the poly(A) site of sequence directly. Based on the research of Arabidopsis, first I get the nucleotide distributing character around the Arabidopsis poly(A) sites using some statistical methods combined with the knowledge about the known cis-elements, then I translate the sequence into numbers; secondly, I build a test model through SOM, which can get fuzzy clustering of sequence through training, and judge the sequence if it has site. In addition, I have developed a method to find the accurate position of poly(A) site. At last, I assess the model via test data. The shining spot of the thesis is predicting whether the sequence has a poly(A) site or has many poly(A) sites.The Sn of test is 91.13%. It demonstrates that the method to identify poly(A) site based on SOM is feasible and effective. I also use it to analyze thousands of data with the accuracy of 63%. We can predict the accurate position of poly(A) site based on the model in future, exclude the false site through biological test, improve the model, and reduce the heavy work in biological lab. These results should be of great significance for further analysis of gene expression.
Keywords/Search Tags:Poly(A) site identification, Model Plant, Self-Organizing Map, fuzzy cluster
PDF Full Text Request
Related items