A statistical approach for information extraction of biological relationships

Posted on:2012-04-26

Degree:Ph.D

Type:Dissertation

University:The Florida State University

Candidate:Bell, Lindsey R

Full Text:PDF

GTID:1458390008992796

Subject:Biology

Abstract/Summary:

Vast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text becomes increasingly evident. Text mining has four major components. First relevant articles are identified through information retrieval (IR), next important concepts and terms are flagged using entity recognition (ER), and then relationships between these entities are extracted from the literature in a process called information extraction(IE). Finally, text mining takes these elements and seeks to synthesize new information from the literature.;Our goal is information extraction from unstructured literature concerning biological entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as protein names, disease names, genes, and small-molecules. Interaction words describe the relationship between the biological terms. Under this framework we aim to combine the strengths of three classifiers in an ensemble approach. The three classifiers we consider are Bayesian Networks, Support Vector Machines, and a mixture of logistic models defined by interaction word.;The three classifiers and ensemble approach are evaluated on three benchmark corpora and one corpus that is introduced in this study. The evaluation includes cross validation and cross-corpus validation to replicate an application scenario. The three classifiers are unique and we find that performance of individual classifiers varies depending on the corpus. Therefore, an ensemble of classifiers removes the need to choose one classifier and provides optimal performance.

Keywords/Search Tags:

Information, Biological, Classifiers, Literature, Approach, Terms

Related items

1	Extended SBN Retrieval Model Based On Ontology Terms Relationship
2	Study On The Network Intrusion Detection Approach Based On Multiple Classifiers Combination
3	The Developmeng And Research Of Electric Power Information Monitoring Management System
4	Research And Application Of MeSH-based Literature Mining Method For Exploring Associations Between Genes And Clinical Terms Of Colorectal Cancer
5	A machine learning approach to automate classification of literature in a SAM research database
6	Cluster-based Query Expansion Using Language Modeling for Biomedical Literature Retrieval
7	Bank Repayment Prediction-System On Deep Learning Techniques
8	Literature Information And Non-documentary Information Cross-referencing To Research
9	On Suppressing Cross-terms In WVD Via Thresholding Superimposition Of Multiple Spectrograms
10	The impact of MeSH (medical subject headings) terms on information seeking effectiveness