Biomedical information extraction: Mining disease associated genes from literature

Posted on:2015-06-22

Degree:Ph.D

Type:Thesis

University:Drexel University

Candidate:Huang, Zhong

Full Text:PDF

GTID:2478390020950746

Subject:Information Science

Abstract/Summary:

Disease associated gene discovery is a critical step to realize the future of personalized medicine. However empirical and clinical validation of disease associated genes are time consuming and expensive. In silico discovery of disease associated genes from literature is therefore becoming the first essential step for biomarker discovery to support hypothesis formulation and decision making. Completion of human genome project and advent of high-throughput technology have produced tremendous amount of data, which results in exponential growing of biomedical knowledge deposited in literature database. The sheer quantity of unexplored information causes information overflow for biomedical researchers, and poses big challenge for informatics researchers to address user's information extraction needs. This thesis focused on mining disease associated genes from PubMed literature database using machine learning and graph theory based information extraction (IE) methods. Mining disease associated genes is not trivial and requires pipelines of information extraction steps and methods. Beginning from named entity recognition (NER), the author introduced semantic concept type into feature space for conditional random fields machine learning and demonstrated the effectiveness of the concept feature for disease NER. The effects of domain specific POS tagging, domain specific dictionaries, and named entity encoding scheme on NER performance were also explored. Experimental results show that by combining knowledge base with concept feature space, it can significantly improve the overall disease NER performance. It has also shown that shallow linguistic features of global and local word sequence context can be used with string kernel based supporting vector machine (SVM) for efficient disease-gene relation extraction. Lastly, the disease-associated gene network was constructed by utilizing concept co-occurrence matrix computed from disease focused document collection, and subjected to systematic topology analysis. The gene network was then merged with a seed-gene expanded network to form heterogeneous disease-gene network. The author identified and prioritized disease-associated genes by graph centrality measurements. This novel approach provides a new mean for disease associated gene extraction from large corpora.

Keywords/Search Tags:

Disease associated, Extraction, Literature, Biomedical, NER

Related items

1	Research On Disease Name Recognition And Disease Normalization In Biomedical Literature
2	Chemical-Disease Relation Recognition Based On Biomedical Literature Mining
3	Research On Mutation-disease Relation Extraction From Biomedical Literature
4	Design And Implementation Of Biomedical Literature Analysis System
5	The Study Of Text-Mining Based Biomedical Entity Relation Extraction
6	Using automated extraction from the medical record to access biomedical literature
7	Research On The Association Between Disease And Drug Based On Scientific Literature Mining
8	Research On Entity Relation Extraction From Biomedical Text
9	Research On Automated Biomedical Relation Extraction From Bio-literature
10	Exploring machine learning and text mining in information extraction using gene expression profiles and biomedical literature