Font Size: a A A

Prediction Of Non-coding RNA Genes Based On Sequence Features

Posted on:2009-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:X K RenFull Text:PDF
GTID:2120360242980421Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Recently, with the in-depth RNA research, non-coding RNA (ncRNA) has become an important hotspot in life science field. Through a variety of mechanisms of gene expression regulation, ncRNA plays a wide range of important roles in the life activities. But human's understanding of non-coding RNA is not adequate, the number in the genome is not yet clear too, and experimental detection efficiency has also been restricted by techniques. Therefore, the sequence recognition of ncRNA in the genome is the problem that needs to solve. Using bioinformatics methods, combining with computer and biological knowledge to predict ncRNA gene sequence, has become a popular choice of the current research, this approach can contribute to the development of the human genome to further understanding. This paper uses prokaryotic and eukaryotic genome sequence as sample, extracts sequence features of the various samples, makes principal component analysis, uses support vector machines to predict ncRNA genes, it provides a new research methods and means for using computer technology to predict ncRNA genes.First, this paper introduced bioinformatics concept, also introduced the non-coding RNA (ncRNA) and the significance of the concept, which is the entire structure of a solid foundation for research work. Non-coding RNAs are functional RNA molecules that do not code for proteins. Today we have found more than ten kinds of ncRNA elements, such as cRNA, mRNA-like RNA, guide RNA, tmRNA, telomerase RNA, signal recognition particle RNA, snoRNA and microRNA. In many life processes, such as transcriptional regulation, chromosome replication, RNA processing, and modified protein synthesis, protein transfer and regulation of gene expression, ncRNAs play an important role, therefore, a comprehensive and detailed understanding of the significance of ncRNA research is extremely significant. Secondly, the status quo of ncRNA prediction was introduced. ncRNA prediction appeared in the circumstances of the biological genome data expanded dramatically. Because ncRNA is identified through the approach of chemical experiments, it costs a lot of manpower and material resources. Therefore, it is necessary to quickly find, locate and identify ncRNA genes combining computer technology and biological knowledge in the genome in terms of time limited circumstances. It can be used as an initial screening test to study for the candidate set to greatly accelerate the process of ncRNA research, so ncRNA prediction research is a valuable and necessary work. Then this article mentioned herein several ncRNA gene prediction algorithm and software, but compared these algorithms with the protein coding gene prediction algorithm, its reliability and efficiency are significantly less, the result is not ideal, and they are usually specific. Each of these algorithms has its advantages and disadvantages, with the development of scientific and technical means, and the continuing accumulating of knowledge of biology, human ncRNA gene structure and function of understanding will be further deepened, as well as new ncRNA gene law will be discovered. Therefore, to find and develop an efficient and common ncRNA gene prediction algorithm or software will become a real target.Thirdly, this paper introduces methods of principal component analysis (PCA) and the concept of principle. Principal component analysis is a data compression, feature extraction and multivariable statistical analysis techniques; it can effectively remove the correlation between the data. This paper uses this approach to analyze the statistical features of nucleotide sequence, such as the ratio of single nucleotide. In the sequences we collect 22 initial features, remove redundant features, and preserve the features of the remaining sequence of the vast majority of information. Ultimately we achieve the effect of reducing the complexity of original data set and noise. At the same time this paper used the MATLAB software, MATLAB has an excellent matrix calculation capability to the tremendous volume of data and the matrix for treatment. Moreover, this paper mainly uses the MATLAB software to help achieve the main component analysis.After reducing features by PCA, we convert raw data sets with the support vector machine to necessary input data format files, and then put it into the Support Vector Machine to predict ncRNA genes. Support Vector Machine (SVM) is a new machine learning methods based on statistical learning theory. It shows many unique advantages in resolving the problems of small sample, and high-dimensional nonlinear in pattern recognition. Support Vector Machine achieved good results in pattern recognition, and the probability density function approximation estimates. In the training sample classification error minimization, under the structural risk minimization criteria for classification, support vector machines improve the classification of promoting the generalization ability. This paper introduces the concept of support vector machines and basic principles, and the use of the support vector machines - LIBSVM. According to the training set, SVM accesses to the appropriate parameter values by adjusting the parameters of LIBSVM software. After determining the parameters of the software, we predict the test suite to test the prediction ncRNA.In conclusion, the main work is that: First, at the network database we collected prokaryotes, and eukaryotes ncRNA sequences and the protein coding sequence as the samples and anti-samples of training and prediction. Access to these samples, we extract the sequence features of samples, and compile it to a matrix. Second is the concrete realization of PCA, and applying it to the above-mentioned sequence features, under the premise that retain most of sequence information, reducing the number of sequence features and the size of matrix, creating the necessary training set and test set. Third, the use of support vector machines for training on the training set, choosing suitable vector machine parameters, generating prediction model, on the basis of the model to predict the testing sets. Finally, the above results are summarized, and analyzing its effectiveness.
Keywords/Search Tags:Prediction
PDF Full Text Request
Related items