Font Size: a A A

Protein-protein Interaction Identification Based On Weak Supervision

Posted on:2019-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y L PengFull Text:PDF
GTID:2370330596950393Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Information on Protein-Protein Interaction(PPI)is of great significance for biological and pharmaceutical research.The PPI information currently found in biomedicine is mostly in the biological literature and is preserved in the form of unstructured text.Biologists have tried to identify the information on PPI manually and store it in a relational database,but as the literature has proliferated,manual identification has become difficult to meet the actual needs.Therefore,it is a new task to study how to identify proteins automatically in the literature.At present,the technology of PPI recognition is mainly method of machine learning based on supervised,but the method based on supervised depends on a lot of text set that marking the information on PPI and with high quality,and it requires a lot of manpower and time to construct this kind of text collection.In order to avoid the above problems,this paper proposes a method of PPI identification based on weak supervision,which can only use a small amount of annotated information.This paper studies from the following three aspects.Firstly,this paper puts forward the method of PPI identification based on weak supervision using sentences as clues.This method is based on single sentence.The method clusters the context of the description of protein relationship,and extracts patterns to describe relationship of interaction,and determine the relationship of interaction by using the patterns.The experimental results show that the method of PPI identification based on weak supervision has obtained good identification results.Secondly,this paper selects the feature words to describe the interactive relationship on the basis of the identification method based on weak supervision.In this paper,we adopt the method of feature selection based on word vector and based on word vector and high-frequency words.Then,the experiment is performed in these methods of feature selection,the experimental results show that it is mostly helpful to identify the information of PPI by using the feature words which is got by the feature selection in the method based on word vector.The F-Score of its best identification results was 2.2% higher than when the feature selection was not used.Thirdly,based on the clues of sentence level(i.e.PPI identification based on weak supervision),this paper introduces the clues of signature level,and obtains a combination model to identify PPI information.The experimental results show that after introducing the similarity of signature on the basis of the method of PPI identification based on weak supervision,under the same score threshold of protein pairs,the F-Scores of identification results are higher and the identification results are more stable.
Keywords/Search Tags:Protein-Protein interaction, method of weak supervision, clustering algorithm, extracting pattern, feature selection, similarity of text
PDF Full Text Request
Related items