Font Size: a A A

Research On Associated Issues In Biomedical Text Mining Based On Discriminative Models

Posted on:2009-09-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:C J SunFull Text:PDF
GTID:1118360278461934Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advancement of computing technology and biotechnology, the amountof biomedical literature is increasing in an unprecedented speed. The literature con-tains the latest research progress and rich biomedical knowledge, which are vital forbiomedical researchers. However, tens of millions of literature makes tracking andcollating the necessary knowledge and information become more and more difficult.Text mining technology can solve this problem and enhance the efficiency of utilizingbiomedical literature. So it is valuable in practice to research the text mining tech-nology for biomedical literature. Discriminative models are a class of models usedin machine learning, which can directly use the features to predicate the probabilityof target variables. In this thesis, conditional random fields model and maximum en-tropy model are used. Compare to generative models, discriminative models needn'tthe assumption that features have to be independent and are consistent with the re-quirements of many text mining tasks. So discriminative models are more likely toachieve good results.This thesis is on how to make use of discriminative models to solve the biomed-ical text mining issues. Concretely, we study on three tasks in biomedical text min-ing: biomedical named entity recognition, biomedical named entity normalization andbiomedical semantic relation extraction. In the three tasks, the second is the extensionof the first in semantic processing; the first and the second are the basis for the third.The major contents of this thesis include the following four parts.The target of biomedical named entity recognition is to identify the named en-tity instances of the specified categories in the documents. It is a necessary step fordeep text mining. On the basis of investigating the characteristics and difficulties ofbiomedical named entity recognition and analyzing the advantages and disadvantagesof current methods for biomedical named entity recognition, we propose to use condi-tional random fields model with rich feature sets to identify biomedical named entity.The feature sets include literal feature, context feature and syntactic feature. In thesefeatures, shallow syntactic features are first introduced into conditional random fieldsmodel when doing boundary detection and semantic labeling at the same time, which effectively improve the model's performance.Supervised machine learning methods need large annotated corpora. Currently,it is easy to obtain un-annotated data in biomedical domain due to the existence ofhuge amount of electronic literature, but corpus annotation is still an expensive work.In order to deal with the lack of large scale annotated biomedical named entity corpus,this thesis proposes maximum entropy based co-training method. This method cantake advantage of the un-annotated data to improve the performance of the classifierstrained on a small scale annotated corpus. Active learning strategy is also integrated tofurther improve the results of co-training. Experiments show the effect of the proposedmethod.The ?exible nomenclature of biomedical named entities results in severe seman-tic ambiguity, which is an obstacle for deep biomedical text mining. Biomedicalnamed entities normalization is an effect way to resolve this problem. The goal ofbiomedical named entities normalization is to correctly associate the named entitiesin documents with standard identifiers. In this thesis, a multi-level disambiguationframework is proposed to accomplish biomedical named entities normalization task.Aiming at different ambiguity situations during the procedure of biomedical namedentities normalization, three different strategies are included in the framework. Theyare dictionary based named entities detection, machine learning based candidate se-lection and knowledge based disambiguation. Experiment results on the test data ofBioCreAtIvE2006 gene name normalization task show that the proposed frameworkcan resolve all kinds of ambiguities during normalization processing effectively.Biomedical semantic relation extraction is one of the main research topics inbiomedical text mining. It is an important mean to extract biomedical knowledge frombiomedical literature. In practice, there are two kinds of relation definition: generaland concrete. The general and concrete definitions are considered as binary classifica-tion and multi-way classification problems respectively and maximum entropy modelis proposed to solve the problems. For a general relation definition, Protein-ProteinInteraction (PPI) relation, we propose a two-phrase PPI Relation extraction methodbased on maximum entropy model. For a concrete relation definition, multi-class PPIrelation, we propose a method which uses maximum entropy model with word fea-tures. In a 10-class PPI relation test data, the method achieved 73.4% accuracy. Thesame method is also applied to a disease-treatment relation extraction and get good re- sults. Besides, we show that discriminative models are more suitable than generativemodels for biomedical semantic relation extraction in both theory and practice.
Keywords/Search Tags:biomedical text mining, named entity recognition, relation extraction, discriminative model, semi-supervised learning
PDF Full Text Request
Related items