Font Size: a A A

Predicting Carbon-Fixing Proteins In Algae With Integrative Sequence-based Features By Machine Learning Methods

Posted on:2022-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:G ZhangFull Text:PDF
GTID:2480306311991429Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Proteins are the main carriers of life activities.Protein sequences determine the functions and properties of organisms.Prediction of the function of proteins can reveal the essential phenomenon of life and physiological functions.Therefore,the exploration for protein function based on sequences is ongoing.The advent of the post-genome era has led to an explosion in the number of protein sequences.However,traditional experiments are time-costing and expensive,and they can't satisfy the requirement of the functional determination for a large number of proteins.With the development of computer technology,the computer modeling methods using data mining and machine learning techniques provide another alternative and effective way to research the function of biological sequences.Algae photosynthetic carbon fixation plays an important role in the ocean carbon cycle.Its absorption,conversion and utilization of carbon dioxide and other greenhouse gases can effectively delay the global warming trend.Algal carbon fixation is conducive to the balance between economy and environment.It is also in line with the strategy of sustainable development.Precisely predicting the algal carbon-fixing proteins is of great significance for researching algal carbon fixation mechanism at the molecular level.In this dissertation,a method are proposed for predicting algal carbon-fixing proteins based on machine learning and integrative features.The results show that this method has high accuracy.Protein sequences data include five phyla of algae from UniPort database were collected in this dissertation.Due to the imbalance of the positive and negative samples,the data set was resampled.In the experiments that using machine learning methods to predict the structure and function of biological sequences,extracting effective features is a key step.Therefore,four types of protein sequences-based features are used in this dissertation.They are functional groups.Shannon entropy,auto-cross covariance and K-mers respectively.The above features include sequence composition,physical and chemical properties of amino acids,and local and global information of the sequences.In training and testing of each type of features,the auto-cross covariance features are better than the other three kinds of features.However,the information extracted from the sequence data by using each type of feature alone is not comprehensive.Integrating all the features made the prediction accuracy of algal carbon-fixing proteins improved.However,high-dimensional features will make modeling and calculation more complicated and bring dimensional disasters.Therefore,features scoring and selection are used to reduce the dimensionality,and it achieve better prediction results.After experiments,all the features have been integrated with a total of 439 dimensions,and 44 dimensions have been retained after feature selection.A variety of indicators were used to evaluate the results.Six machine learning methods include K-nearest neighbor(KNN)algorithm,Naive Bayes(NB)algorithm,neural network(NN)algorithm,random forest(RF)algorithm,support vector machine(SVM)model and XGBoost model are used in the dissertation.In the end,all six machine learning classifiers achieved high performance indicators on the data set.In order to evaluate the effect of experiments,statistical analysis and feature significance analysis were performed in this dissertation.In addition,multiple sequence alignments of algal carbon-fixing proteins were performed,and motifs closely related to carbon-fixing function were extracted.The results show that this method can effectively obtain features and predict algal carbon-fixing proteins.It can lay the theoretical foundation for protein engineering and genetic engineering of algal carbon fixation,and provides a new way to research at the molecular level and using advanced information technology.It is also benefit to alleviate the negative impact of climate warming and promote the harmonious development of nature and economy.
Keywords/Search Tags:Algal Carbon-Fixing Proteins, Machine Learning, Integrative Features, Feature Selection, Sequence Motifs
PDF Full Text Request
Related items