Font Size: a A A

Linguistic validation of automatic subtopic segmentation

Posted on:2005-08-08Degree:Ph.DType:Dissertation
University:Boston UniversityCandidate:Saidi, Aisha FFull Text:PDF
GTID:1458390008490433Subject:Language
Abstract/Summary:
This study evaluates a technique for automatically segmenting medical history paragraphs by subtopic with the view that subtopic language models could be created in order to improve speech recognition of the medical history sections of medical reports. The technique uses a Hidden Markov Model segmenting tool to mark boundaries of hypothesized subtopic segments within each text. Since the tool is built on the assumption that the input texts have a similar topic structure, it can be used to segment medical histories, which generally have a three part structure.; The data for this study is a set of medical histories extracted from 2,700 orthopedic medical reports. The study is carried out in four broad steps. First, a group of linguists independently mark the subtopic structure of a test set of medical histories; histories upon which there is significant agreement become the standard by which the automatic segmenter is evaluated. Next, the automatic segmenter is trained on a large set of histories. Then, using the statistical information built from the training data, the automatic segmenter marks subtopic segments in the test data. Finally, the automatic segmentation of the test data is graded against the evaluation standard developed by the expert subjects.; Two types of subtopic segmentation are explored in this work. The first type, linear subtopic segmentation, assumes that each of the three subtopics in a medical history is a continuous chunk of text within the paragraph, uninterrupted by other subtopics. Despite the relatively homogenous structure of medical histories, this model is found to be linguistically unrealistic, and the performance of the automatic segmenter is poor compared to the evaluation standard. The second type, non-linear subtopic segmentation, allows each sentence to be assigned to a subtopic regardless of order. Because of the variability of the data, the tool is unable to successfully distinguish three subtopics in the histories. However, the automatic segmentation of two non-linear subtopics for each medical history is successful, with a high rate of accuracy compared to the human standard.
Keywords/Search Tags:Subtopic, Automatic, Medical, Standard
Related items