Font Size: a A A

Biomedical Entity Relation Extraction Based On Semi-supervised Learning And Deep Learning

Posted on:2017-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:Q L FengFull Text:PDF
GTID:2348330488959716Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid growth of biomedical literature, the technology of information extraction (IE) in the biological literature has been studied extensively. Until now, most of the works in information extraction for biomedical literature are related to relation extraction. In the biomedical domain, relation extraction mainly focuses on recognizing the biomedical entities name (proteins, drugs, diseases, genes, etc.) and extracting the semantic relations between entities. This thesis focuses on studying the relations between disease-symptom, symptom-therapeutic substance and protein-protein, and proposes a semi-supervised learning and deep learning methods to solve the problem that are lack of labeled data and manual feature construction in entities relations extraction, respectively.To solve the problem of the lack of labeled data in extracting relations among disease and symptom, and symptom and therapeutic substance, this thesis proposes two semi-supervised learning algorithms, Co-Training and Tri-Training, to construct the disease-symptom model and symptom-therapeutic substance model. In the training process, the feature kernel, graph kernel and tree kernel are used as input views of Co-Training and Tri-Training methods. In the Tri-Training method, we use ensemble learning to integrate several classifiers. Experimental results show that Co-Training and Tri-Training algorithms can both utilize the unlabeled data along with a few labeled examples to improve the classification performance. In addition, the performance of Tri-Training outperforms Co-Training in the experiment.Using semi-supervised learning methods for relation extraction of the disease-symptom and symptom-therapeutic substance, requires large scale of manual features, the quality of these features have direct impact on the experimental results. Moreover, the construction of a large number of features is time-consuming and laborious. To solve this problem, this thesis leverages a convolutional neural network method for relation extraction of disease-symptom and symptom-therapeutic substance. This method can automatically learn features from the corpus and acquire a feature hierarchy, which reduces the cost of manual feature construction. Meanwhile, this paper uses the Tri-Training method to expand the corpus. Experimental results show that, compared with Tri-Training, convolutional neural network method can obtain a better result.There are two problems in relation extraction based on semi-supervised learning. On the one hand, semi-supervised learning choses unlabeled data which are labeled consistent by classifiers, this method may lose some information. One the other hand, when unlabeled data are added to the training set, these samples may be labelled improperly. To solve these two problems, this paper proposes an improved tri-training method for protein-protein interaction extraction (PPIE). This method chooses unlabeled data which is labeled inconsistent by three classifiers and uses active learning method to label these unlabeled data. Experimental results show that, compared with other methods, this method can achieve better performance with 68.80% F-score on the AIMED corpus.
Keywords/Search Tags:Information Extraction, Semi-supervised Learning, Unlabeled data, Convolutional Neural Network, Active Learning
PDF Full Text Request
Related items