Font Size: a A A

Research On Automated Biomedical Relation Extraction From Bio-literature

Posted on:2013-11-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:H T ZhangFull Text:PDF
GTID:1228330392458290Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The important relations between biomedical entities have attracted significantattention from biologists. However, an enormous number of biomedicalrelations are buried in millions of biomedical research articles that have beenpublished over the years, and the number is growing. Rediscovering them frombio-literature automatically is a challenging bioinformatics task. In this thesis,we focus on some key issues in the biomedical relation extraction task,including the construction of feature vector, the class imbalance problem indatasets, the small-labeled datasets and the re-use of labeled datasets fromdifferent domains. The major contributions of this thesis are as follows:(1) For the construction of feature vector, we propose a compact feature vector.The two major advantages of this feature vector are its rich features and its compactfeature representations. Specifically, it integrates keyword features, part-of-speechfeatures, syntactic features and lexical pattern features, which can express muchimportant information for the complex biomedical text; it employs the compact featurerepresentations for the above features, which can alleviate the sparse feature problemcaused by the integration of rich features.(2) For the class imbalance problem in datasets, we propose two methods based onunder-sampling, i.e., the adaptation based under-sampling method and the dynamicco-training based random under-sampling method. The former method makes anadaptive adjustment for the classifier during the under-sampling process, while the lattermethod enables under-sampling on an expanded sample space by integrating theunder-sampling and over-sampling. Both methods reduce the risk of losing criticalsamples when employing under-sampling based methods to address the class imbalanceproblem.(3) For the small-labeled datasets, we propose a unified active learningframework. The proposed framework is a more appropriate active learningframework for the biomedical relation extraction task. In addition to thecommon data selection module, it integrates the diverse data selection module,the active feature acquisition module and the informative feature selection module. The experimental results show that the proposed framework effectivelyreduces the reliance on the size of labeled datasets.(4) For the re-use of labeled datasets from different domains, we build are-use framework based on transfer learning. Specifically, the re-use frameworkincludes the instance based transfer learning re-use method, the feature groupbased transfer learning re-used method and the integration of active learning andtransfer learning re-used method. The former two methods accomplish there-use of cross-domain labeled datasets by transfer learning based on instancegranularity and feature granularity, while the latter method integrates theadvantages of active learning and transfer learning, in order to shed some lighton more practical tasks.
Keywords/Search Tags:biomedical relation extraction, compact feature vector, under-sampling, active learning, transfer learning
PDF Full Text Request
Related items