Font Size: a A A

Automated classification of the narrative of medical reports using natural language processing

Posted on:2012-03-15Degree:Ph.DType:Dissertation
University:State University of New York at AlbanyCandidate:Goldstein, IraFull Text:PDF
GTID:1458390008492653Subject:Information Technology
Abstract/Summary:PDF Full Text Request
In this dissertation we present three topics critical to the document level classification of the narrative in medical reports: the use of preferred terminology in light of the presence of synonymous terms, the less than optimal performance of classification systems when presented with a non-uniform distribution of classes, and the problems associated with scarcity of labeled data when presented with an imbalance of classes in the data sets.;The literature is replete with instances of conflicting reports regarding the value of applying preferred terminology to improve system performance when presented with synonymous terms. Our study shows that the addition of preferred terms to the text of the medical reports helps to improve true positives for a hand-crafted rule-based system and that the addition did not consistently improve performance for the two machine learning systems. We show that the differences in the data, task, and approach can account for the variations in these results as well as the conflicting reports in the literature.;The imbalance of classes in data sets can cause suboptimal classification performance by systems based on an exploration of statistics for representing attributes of data. To address this problem, we developed specializing , a panel of one-versus-all classifiers, which have been activated in a strict order, and apply it to an imbalanced data set. We show that specializing performs significantly better than voting and stacking panels of classifiers when used for multi-class classification on our data.;Machine learning systems need labeled data in order to be trained, which is expensive to develop and may not always be readily available. We combine the semi-supervised approach of co-training with specializing in order to address the issues associated with a scarcity of labeled examples when presented with an imbalance of classes in the data sets. We show that by combining co training and specializing, we are able to consistently improve recall on the less well-represented classes, even when trained on a small number of labeled samples.
Keywords/Search Tags:Medical reports, Classification, Classes, Data, Improve, Labeled, Presented
PDF Full Text Request
Related items