Font Size: a A A

Predicting Anti-CRISPR Proteins By Machine Learning

Posted on:2022-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:M L LiuFull Text:PDF
GTID:2480306764969229Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
CRISPR-Cas,as a tool for gene editing,has been widely studied in recent years.Anti-CRISPR proteins are widely present in bacteria,archaea and viruses.They inactivate the CRISPR-Cas defense system during interference phase.Therefore,they can be used as a potential tool for the regulation of gene editing.In-depth study of their properties and functions is of great significance for the implementation of gene editing.However,research on anti-CRISPR proteins is very limited.The number of known anti-CRISPR proteins is very small,which is one of the factors limiting its further research.Finding new anti-CRISPR proteins by experimental methods is time-consuming.Thus,using machine learning methods to build a powerful prediction model can solve this problem very well.In order to build a high-quality dataset,raw sequences were extracted from anti-CRISPRdb and a unified resource for tracking anti-CRISPR names.Then,CD-HIT was applied to remove redundant samples.Down-sampling method was used to obtain a balanced benchmark dataset.Six different kinds of features were used to encode sequences respectively,and the features were further selected by analysis of variance and incremental feature selection.Based on the optimal feature subset,a support vector machine model was constructed by each feature separately.In order to improve the accuracy,the six models were combined in different ways for selecting the optimal model.The accuracy and AUC of the final model on the test dataset are 88.1% and0.952,respectively.This is by far the best prediction model.The model was used to analyze 11 new anti-CRISPR proteins,and 10 of them were correctly identified,indicating that the model has strong generalization ability.To discover new anti-CRISPR proteins,all of virus proteins from Gen Bank were predicted by the model.Users can obtain potential anti-CRISPR protein according to different thresholds.The higher the number of proteins,the higher the reliability of the prediction;On the contrary,the more the number,the lower the credibility.Finally,an online website was established,which includes the dataset constructed in this thesis and the anti-CRISPR proteins predicted from virus.It also provides the prediction service by inputting the protein sequence and position specific score matrix.The address of the website is http://lin-group.cn/server/Acr Pred.
Keywords/Search Tags:Anti-CRISPR Protein, Sequence Prediction, Feature Encoding, Machine Learning, Website Service
PDF Full Text Request
Related items