Font Size: a A A

Design And Implementation Of A Structured Processing System For Pathological Text Data

Posted on:2016-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:S LiangFull Text:PDF
GTID:2298330452966419Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In each big hospital, large numbers of unstructured clinical documents would be producedduring the process of providing clinical medical service. Pathological report is a kind of importantunstructured clinical document. Its content is mainly formed in text data by pathological doctorsusing natural languages, in which patient’s basic information, sample described in eye, samplesunder a microscope, diagnostic results and other information are recored. The text content in apathological report is very important for doctors to process disease diagnosis and one of theimportant basis for clinical diagnosis and treatment.The traditional processing method of pathology report, generally relys on manual processingto the attending physician with experience of pathological report of the text content. the process isessentially clinical thinking on doctor’s structured treatment on pathology of text content, tomanually extract data in the pathological specimen and the value of each index. However, themanual processing method is not only time-consuming, but also difficult to ensure correction rate.Therefore, by means of rule extraction, topic model and statistical analysis methods, this paperdesign and implementation a pathological data structured text processing system to support theautomatic extraction of specimen and its index value.At first, the text data structure of the pathology report are analyzed, and give the concept ofhierarchical structure of pathology in the text, and then describe the structural process treatment ofpathological report. On this basis, this paper designs the overall framework of pathological textstructured processing system, and introduces the three modules of the system, namely pathologicaldata preprocessing module, pathological specimens template extraction module and pathologicaltext instant structured module main functions and tasks.Then, in order to solve the problem of pathological specimen description of templateextraction, an algorithm based on latent Dirichlet (LDA) topic model specimen descriptiontemplate extraction algorithm is put forward. The realization of the algorithm consists of two steps. The first step, preprocessing the historical pathology report. Firstly, cut the pathology reportaccording to the rules into different specimens, then extract the specimens of the name. Finallyplace the text description of the same specimen in the same collection. The second step, templateextraction. First of all build the LDA modeling on the same specimen data sets, get the probabilitydistribution of words in the collection followed by rapid Gibbs extraction methods, combined withIDF weight value text value computation, obtained the text collection "text value" the highestTop-N words as candidate keys of the present specimens. Finally through reverse text filteringalgorithm based on frequency, optimal extracted specimens template. At the same time,considering the historical pathological data after a long-term accumulation, with huge amount ofdata. Using Parallel computing framework, MapReduce, implements the pathological specimenstemplate extraction processing, and deploy it to the open source distributed platform Hadoop.Finally, in order to validate the proposed algorithm, this paper test the system through a realdata set. The results show that the system can correctly extract of which more than87.5%of theindex name and the index value. Structured pathology report results can satisfy the expecteddemand. At the same time the system in the process of user feedback, can continuously carry outreasonable optimization for template.Pathological data structured text processing system to achieve not only can assist doctors todiagnose disease, and can provide some of the data support for future disease pathologicalanalysis.
Keywords/Search Tags:Pathology Data, Chinese Character Segmentation, Structured Text, LDA, Hadoop
PDF Full Text Request
Related items