Font Size: a A A

Identification Of Long Non-coding RNA And MRNA Based On Maximum Entropy And K-mer

Posted on:2016-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:M WeiFull Text:PDF
GTID:2310330488457216Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput sequencing technology, a large number of transcripts have been found in many species, including human, mouse and other organisms. In these transcripts, the content of long non coding RNAs is quite rich, accounting for about 4-9% of RNAs, and the m RNAs only about 1-2%. Especially some long non-coding RNAs have been confirmed to participate in many important life processes, such as cell differentiation, immune response, signaling pathways and metabolic regulation pathways and so on. In addition, a large number of studies have indicated the intimate relationship between the long non-coding RNAs and human disease. Therefore, the study of the function of the long non coding RNAs, as well as the correlation with human disease has become a hot research topic currently. However, there are still numbers of long non coding RNAs which have not been identified, so how to distinguish between long non coding RNAs and m RNAs is also an urgent task to solve.Two difficulties existing in constructing the model of identifying long non-coding RNAs and m RNAs. First, the number of the transcripts sequenced by the high-throughput technology is huge, and there is no complete genome annotation, especially for long non-coding RNAs; second, high-throughput sequencing technology may produce some inevitable sequencing error. These are the challenges to the identification of long non-coding RNAs and m RNAs.To overcome the difficulties, this paper proposed an identification algorithm based on maximum entropy and k-mer. The k-mer features are extracted from the samples, and some of the k-mer features are selected by the algorithm base on maximum entropy principle; the identification model of the long non-coding RNAs and the m RNAs is established by the tool called lib SVM. We use 5-fold cross validation to validate the model on the training data, and the recognition accuracy reaches 94.96%. The cross-species experiments show that the algorithm has a certain adaptability. We also test our algorithm on simulating the Indel sequences and the real sequences, and both the identification performance and the robustness are better compared with other recognition algorithms.
Keywords/Search Tags:high-throughput sequencing, maximum entropy, long non coding RNA, mRNA, k-mer, robustness
PDF Full Text Request
Related items