Font Size: a A A

A statistical approach for information extraction of biological relationships

Posted on:2012-04-26Degree:Ph.DType:Dissertation
University:The Florida State UniversityCandidate:Bell, Lindsey RFull Text:PDF
GTID:1458390008992796Subject:Biology
Abstract/Summary:
Vast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text becomes increasingly evident. Text mining has four major components. First relevant articles are identified through information retrieval (IR), next important concepts and terms are flagged using entity recognition (ER), and then relationships between these entities are extracted from the literature in a process called information extraction(IE). Finally, text mining takes these elements and seeks to synthesize new information from the literature.;Our goal is information extraction from unstructured literature concerning biological entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as protein names, disease names, genes, and small-molecules. Interaction words describe the relationship between the biological terms. Under this framework we aim to combine the strengths of three classifiers in an ensemble approach. The three classifiers we consider are Bayesian Networks, Support Vector Machines, and a mixture of logistic models defined by interaction word.;The three classifiers and ensemble approach are evaluated on three benchmark corpora and one corpus that is introduced in this study. The evaluation includes cross validation and cross-corpus validation to replicate an application scenario. The three classifiers are unique and we find that performance of individual classifiers varies depending on the corpus. Therefore, an ensemble of classifiers removes the need to choose one classifier and provides optimal performance.
Keywords/Search Tags:Information, Biological, Classifiers, Literature, Approach, Terms
Related items