Font Size: a A A

Information Extraction For Evidence Based Medicine Using Natural Language Processing

Posted on:2012-06-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X LuFull Text:PDF
GTID:1114330371465442Subject:Medical informatics
Abstract/Summary:PDF Full Text Request
Background:Epidemiology is the study of factors affecting disease in populations. Much information from epidemiologic studies resides in biomedical literature, but it is not in a computable format. Traditionally, evidence based medicine rely on manually reading of epidemiological studies literature to extract information. However, it is time consuming and cost ineffective, especially when considering the exponential growth of epidemiological articles. In order to develop automated methods to extract information and to build knowledge bases of evidence based medicine, we investigated rule based classifier and machine learning (ML) approaches to extract exposure and outcome terms from biomedical literature.Methods:We developed two automated systems to extract noun phrases about epidemiologic exposures and outcomes from biomedical literature. In our initial study, we developed a system called DEEL (Detection of Epidemiologic Exposures from Literature), which consists of a natural language processing (NLP) engine and a rule-based classifier, to automatically extract exposure-related terms from epidemiologic articles. Then we developed another system, which also consists of two components:1) a natural language processing (NLP) engine that can identify noun phrases and collect their contextual information; 2) a ML-based classifier that determines categories of noun phrases:exposures, outcomes, or non-related terms, by using information extracted by the NLP engine. Four ML algorithms (Naive Bayes, Decision Tree. SVM, and Logistic Regression) were applied and compared over different feature sets such as neighborhood words and the semantic types of words.Results:The evaluation of DEEL using titles annotated by an epidemiologist showed the highest F-measure of 64.6%(Precision 61.0% and Recall 68.8%) using in-exact matching, which indicated the feasibility of automated methods on mining epidemiologic literature. We further analyzed terms related to epidemiologic exposures and results suggested that although UMLS would have reasonable coverage, more appropriate semantic classifications of epidemiologic exposures would be required.To evaluate the performance of the ML-based classifier, we manually constructed an annotated dataset consisting of 1,600 titles of articles from the American Journal of Epidemiology (AJE). The system achieved the highest F-measure of 82.0%(83.0% in precision and 81.1% in recall) for extracting exposure terms, and 70.0%(75.5% in precision and 65.2% in recall) for extracting outcome terms.Conclusions:In this study, we developed a rule-based classifier and a ML-based classifier to automatically identify terms related to epidemiologic exposures and outcomes from biomedical literature with reasonable performance. The ML methods outperformed prior studies using a rule-based system. These methods may be helpful for developing automated knowledge extraction systems for epidemiologic studies.
Keywords/Search Tags:Natural language Processing (NLP), Information Extraction (IE), Name Entity Recognition (NER), Evidence Based Medicine (EBM), Epidemiology, Machine Learning (ML), Data Mining, Biomedical Literature, Epidemiologic exposure, Epidemiologic outcome
PDF Full Text Request
Related items