Font Size: a A A

Research On Essential Protein Recognition Based On Random Forest Algorithm

Posted on:2020-12-22Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:2370330599462961Subject:Agricultural informatization
Abstract/Summary:PDF Full Text Request
Identifying proteins that are useful in living organisms is extremely important for the evolution of organisms and the medical field.There are two ways to distinguish the importance of proteins today.The first is based on biochemical methods,but the use of biological experiments to identify certain defects,such as: longer time,higher cost,and can not handle the problem of large amount of data and so on.The second is a way to use computers as tools to analyze organisms and interpret them with biologically relevant knowledge.Most methods for recognizing the importance of proteins using computers are identified by using the Protein Interaction Network to extract topological metrics.However,due to the incompleteness of some related biological experimental data and the complexity of the protein network itself,no single central metrics can be found that can accurately distinguish between key and non-critical proteins,and from the current related research,key proteins and non-critical The difference between proteins cannot be determined by a single feature and should be determined by a combination of factors.Single centrality metrics often fail to identify key proteins effectively.It is necessary to integrate multiple topological central metrics,break through the traditional method of fine selection using sorting,and establish a machine learning model for protein classification and recognition.The random forest algorithm is an integrated type of algorithm,which can integrate multiple single classifiers,that is,integrate the classification effects of multiple decision trees to form a classifier in a global sense.In view of the previous research,the single feature is used for classification and recognition,and because the random forest has the advantages of the aggregate multi-classifier,the classification effect has obvious advantages.Therefore,this paper chooses the random forest machine learning method to identify the importance of the protein.This paper will analyze the structure of protein network,integrate multiple topological centrality measurement methods,and build a model using random forest algorithm to study and analyze the identification of key proteins.In this paper,budding yeast protein was selected as the research object.The specific research contents include cleaning the collected data,constructing a protein network(PPI),selecting six central metrics for feature extraction,constructing a model for identifying key proteins,and selecting Random forest algorithm,and the experimental results were evaluated by statistical indicators.The results show that the algorithm can identify key proteins accurately and quickly,and eliminate interference factors such as false positives and redundancy,which has higher recognition ability than other algorithms.In summary,the paper proposes a fusion of multiple central metrics,and the use of random forest algorithm to establish a protein importance prediction model can more effectively identify key proteins.
Keywords/Search Tags:Protein interaction network, Key protein, Machine learning, Random forest
PDF Full Text Request
Related items