Font Size: a A A

Design And Implementation Of Protein-protein Interaction Extraction System Based On Mapreduce

Posted on:2017-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:F GaoFull Text:PDF
GTID:2180330485980594Subject:Agricultural informatization
Abstract/Summary:PDF Full Text Request
With the rapid development of biomedical technology, the number of biomedical literature is also growing rapidly. Extracting biomedical information from the rapidly growing biomedical literature has become a hot research topic in the text mining. Protein plays an important role in life activities, the text mining technology can be used in the biomedical literature to quickly extract the protein-protein interaction information to offer help for biomedical experts. In recent years, the development of big data technology provides a new method and idea for the extraction of biomedical information.This study based on feature vector approach to extract protein-protein interaction information,the process is as follows:Firstly,construct the protein-protein interaction information extraction system. In this study, using machine learning method to preprocess the training data. The main work of the preprocessing is tokenize processing, part of speech tagging, shallow syntactic analysis, etc. Using the preprocessing data to extract verb feature, lexical feature, chunk feature, phrase feature, etc. Then these features are used to form the feature vector. We use LIBSVM to test the extraction system. The experimental results show that the c extraction system is good.Secondly, the extraction work process on MapReduce is divided into Map stage and Reduce stage. The main work of Map stage contains protein named entity recognition, text preprocessing, feature extraction and vector construction of test corpus. The Map stage output Key value is protein relationship instance, Value value is the feature vector which the relationship instance corresponding. The main work of Reduce stage contains converting the Value value that from Map stage into LIBSVM can identify, loading classification model, using classification model to classify and judge. The Reduce stage output Key value is the Key value which comes from the Reduce stage input, the Value value is null. The experimental results show that extraction of protein-protein interaction information on MapReduce can save a lot of processing time compared to the single machine extraction for large number of biomedical texts.Lastly, using the protein-protein interaction information to construct the extraction system. Struts2 framework, JSP technology and JUNG network visualization tools are used to construct the system. This system has realized the protein-protein interaction information retrieval, text processing, protein-protein interaction network visualization etc. Through the construction of the extraction system,protein-protein interaction information can be quickly retrieved, and can use network diagram to carry out more intuitive display.
Keywords/Search Tags:protein-protein interaction, information extraction, hadoop, mapreduce
PDF Full Text Request
Related items