Automated classification of the narrative of medical reports using natural language processing

Posted on:2012-03-15

Degree:Ph.D

Type:Dissertation

University:State University of New York at Albany

Candidate:Goldstein, Ira

Full Text:PDF

GTID:1458390008492653

Subject:Information Technology

Abstract/Summary:

PDF Full Text Request

In this dissertation we present three topics critical to the document level classification of the narrative in medical reports: the use of preferred terminology in light of the presence of synonymous terms, the less than optimal performance of classification systems when presented with a non-uniform distribution of classes, and the problems associated with scarcity of labeled data when presented with an imbalance of classes in the data sets.;The literature is replete with instances of conflicting reports regarding the value of applying preferred terminology to improve system performance when presented with synonymous terms. Our study shows that the addition of preferred terms to the text of the medical reports helps to improve true positives for a hand-crafted rule-based system and that the addition did not consistently improve performance for the two machine learning systems. We show that the differences in the data, task, and approach can account for the variations in these results as well as the conflicting reports in the literature.;The imbalance of classes in data sets can cause suboptimal classification performance by systems based on an exploration of statistics for representing attributes of data. To address this problem, we developed specializing , a panel of one-versus-all classifiers, which have been activated in a strict order, and apply it to an imbalanced data set. We show that specializing performs significantly better than voting and stacking panels of classifiers when used for multi-class classification on our data.;Machine learning systems need labeled data in order to be trained, which is expensive to develop and may not always be readily available. We combine the semi-supervised approach of co-training with specializing in order to address the issues associated with a scarcity of labeled examples when presented with an imbalance of classes in the data sets. We show that by combining co training and specializing, we are able to consistently improve recall on the less well-represented classes, even when trained on a small number of labeled samples.

Keywords/Search Tags:

Medical reports, Classification, Classes, Data, Improve, Labeled, Presented

PDF Full Text Request

Related items

1	Adaptive classification of scarcely labeled and evolving data streams
2	Using unlabeled data to improve text classification
3	Text Classification Based On Improved Labeled-LDA
4	A Trusted-item-based Interactive Method To Improve The Quality Of Labeled Data And Its Application
5	Classification And Recognition Of Image Based On Local Features And Weakly Labeled Data
6	Design And Implementaion Of Finance News Classification System Based On Labeled-LDA
7	Research On Image Classification And Video Tracking With Weakly Labeled Data
8	Multi-valued And Multi-labeled Data Classification
9	Research On Data Stream Classification Algorithm With Limited Amount Of Labeled Data
10	Strategy And Improvement In Reports Of Medical Disputes In The Event