Font Size: a A A

Research And Implementation Of CRISPR-Cas System Prediction Method

Posted on:2024-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:J B WangFull Text:PDF
GTID:2530306917956779Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
Clustered Regularly Interspaced Short Palindromic Repeats(CRISPR)and Related Proteins(CRISPR associated,Cas)system(CRISPR-Cas)are an acquired immune system present in prokaryotes(approximately 90%of archaea genes and 48%of bacterial genes).The immune process includes three stages:adaptation,expression,and interference,which can specifically cleave foreign invading viruses or plasmids.The CRISPR-Cas system mainly consists of three parts:CRISPR array,leader sequence,and Cas gene.The CRISPR array consists of Direct Repeat(DR)and Spacer.The precursor sequence contains promoters and can transcribe the CRISPR array into RNA,participating in the immune regulation of the CRISPR-Cas system.The Cas gene can be transcribed and translated into Cas protein,participating in the entire immune process of the CRISPR-Cas system.At the same time,CRISPR-Cas system,as a genome editing means,has become an important part of the new generation of genetic engineering,and has been widely used in agriculture,food,medicine and other fields.So far,a large number of CRISPR-Cas systems have not been discovered,which to some extent hinders systematic research on the evolution of the immune system in prokaryotes.It is very difficult to explore a large number of CRISPR-Cas systems through traditional experimental methods.Therefore,it is necessary to first use bioinformatics methods to review the entire genome sequence of prokaryotes,predict potential CRISPR-Cas systems,and then further validate them through experiments.This article conducts detailed theoretical research and system implementation on the prediction problem of two important components of the CRISPR-Cas system(CRISPR array and Cas gene),with the main contributions as follows:(1)A new CRISPR array prediction method,named CRISPR-F,has been proposed.This method confirms the target sequence by comparing the local sequence alignment results of the search sequence and the search range sequence.In order to make sequence alignment more biologically meaningful and maintain high conservation of repetitive sequences,only two base sites are allowed to undergo mutations during local double sequence alignment.Due to the permission for base mutations,bases with different left ends of the target sequence after alignment are ideally considered interval sequences.Therefore,remove different bases and add the same base to the right end,up to the maximum length of the DR.After aligning multiple sequences from different sequence groups to obtain the core sequence group,add base sites to both sides with a probability of at least 75%occurrence,and check for degraded duplicate sequences on both sides.Finally,after checking the constraint conditions,the final CRISPR array is obtained.And predict the direction of the CRISPR array through the AT content on both wings of the CRISPR array.Compared with CRT,which completely treats the entire gene sequence as a string,CRISPR-F predicts duplicate sequences with more biological significance,fully considering substitution,insertion,and loss in repeat sequence mutations.(2)A novel Cas gene prediction method,named Cas-F,has been proposed.This method is an improved method for Stacking model fusion.Firstly,feature extraction was performed on protein sequences from three perspectives:Pseudo-Amino Acid Composition,Composition of K-Spaced Amino Acid Pairs,and Position-Specific Scoring Matrix,and three SVM model based learners were established.Each SVM model is trained with ten fold cross validation.Integrate the prediction results of these three SVM model based learners into a new feature vector and train a new SVM model.Use the final SVM model as the final prediction result to determine whether the protein sequence belongs to the Cas protein sequence.After the model identifies the protein sequence as a Cas protein sequence,it is necessary to perform a sequence similarity search on the protein map using the HMMER program to determine the Cas type.Through experiments,it has been proven that the accuracy of cross validation of pseudo amino acid composition,k-interval amino acid pair composition,positional specificity score matrix,and fused SVM model is 69.42%,82.43%,80.09%,and 83.78%,respectively.(3)We have constructed the CRISPR-Cas system prediction and visualization platform,named CRISPRCasFV.On the basis of CrisprVi,this platform has added functions of CRISPR-F and Cas-F,and achieved visualization of these two new modules.The test results indicate that users only need to upload the predicted whole genome sequence file(default FASTA format file)in CRISPRCasFV,and CRISPRCasFV can display the predicted results of CRISPR-F and Cas-F in detail in their respective visualization interfaces.The CRISPR-F visualization interface displays information including the starting and ending sites of repeat and interval sequences,sequence information,sequence length,and direction.The information displayed in the Cas-F visualization interface includes Cas type,direction,start site,end site,DNA sequence,and protein sequence.In summary,based on a systematic study of the prediction methods of existing CRISPR-Cas systems,this article proposes and implements a prediction method for CRISPR arrays and Cas genes,and integrates it into the CRISPR-Cas system prediction and visualization platform CRISPRCasFV to promote relevant research in the field of CRISPR-Cas systems.
Keywords/Search Tags:CRISPR-Cas system, CRISPR prediction, Cas prediction, visualization
PDF Full Text Request
Related items