Font Size: a A A

Automatic Approaches To Develop Large-scale TCM Electronic Medical Record Corpus For Named Entity Recognition Tasks

Posted on:2016-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:L Z FengFull Text:PDF
GTID:2298330470455542Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
TCM (Traditional Chinese Medicine) is the clinical medicine which is a practice research and based on observation. The medical records are the documents which records the whole treatment process on patients. These records can ensure and improve the quality of medical technology. And they are not only the important resource to strengthen the practice of teaching and improve research capacity, but also are the precious experience. Therefore, TCM clinical records have become an important resource for medicine and informatics concerns.At present, text mining and natural language processing applications, algorithms, and corpus which apply to the English language as carrier of medical literature has been relatively mature. However, the clinical medical records research is still at the front, and the text mining research of Chinese clinical medical in our country is in the initial stage. Meanwhile the development of large-scale domain corpus is the basis to promote high quality research. Due to lack of the large-scale corpus in TCM, therefore, TCM clinical corpus and its construction methods is an important work and urgently needed to carry out.The application target of this paper is to realize named entity extraction for clinical medical. The content of this paper is not only to find several named entity recognition (NER) methods to assist the batch annotations of clinical records and create large-scale clinical medical corpus before manual checking, according to the characteristic of TCM clinical records, but also develop the Chinese medicine clinical medical records tagging system. Specific tasks are as follows:(1).For the problems of developing the large-scale TCM corpus which is named entity recognition (NER) oriented, we have achieved three automated named entity recognition methods respectively are structured clinical records, Conditional Random Fields (CRFs) and Bootstrapping. And we first attempt to use hybrid named entity recognition method based on Bootstrapping in this paper. At the same, we use the2,500training set have done some experiments. The F1of structured clinical records is76.46%, Bootstrapping is53.8%, CRFs is98%and the hybrid named entity recognition methods on the base of Bootstrapping has achieved87%. Experimental results and analysis are given to demonstrate these methods not only enables to realize the goal which batch annotate the text clinical records (with the main complaint of the demonstration), but also to provide the basis methods for the construction of large-scale TCM corpus.(2). We initially developed an named entity recognition system to batch annotate the TCM clinical records, which not only implement the above-mentioned three named entity recognition methods, but also support for batch import clinical records, batch annotate clinical records, manual review and other major features. What’s more, it have realized the export function which do export the annotated corpus with standardized format (XML format in regulate the industry). Moreover, we have import32,411clinical visit, the total of clinical medical records is351,963, and have annotated and preliminary reviewed3,550TCM records. So we have developed an initial corpus that includes diagnosis, text medical records, medical records and other basic information to complete the content.(3). We have study on the relationship of CRFs annotation performance and similar structure of sample set by calculating the edit distance between the characters of clinical records for the relationship of machine learning methods which apply to named entity extraction such as CRFs with sample similarity structure. Experimental results show that, in the open test, the performance CRFs (expressed by F1value) is inversely proportional to the minimum edit distance between the test sample and the training samples. The larger the average minimum edit distance between the test sample set and the training set distance, the lower the performance of the CRFs. The worst F1value close to68%when the minimum edit distance is0.9. This indicates that it is one of the key issues to build a field representative corpus sample set in order to improve the performance of automatic labeling.
Keywords/Search Tags:Chinese medicine clinical text, named entity recognition methods, annotation system, Levenshtein distance, corpus
PDF Full Text Request
Related items