Font Size: a A A

Research On Part-of-speech Tagging For Chinese Electronic Medical Records

Posted on:2015-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:F F ZhaoFull Text:PDF
GTID:2298330422490922Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,“Smart Healthcare” has become thedevelopment trend of global health care industry. As carriers of medical informatization,electronic medical records (EMR) contain a large number of medical and healthknowledge. Knowledge of electronic medical records can provide services for medicaldiagnosis, management of user health and Medical coordination and other fields.Mining knowledge of EMR is inseparable from natural language processing andinformation extraction technology. Research on CEMR part-of-speech(POS) tagging,which is the foundation of natural processing technology, contribute to the study offollow-up research parsing and information extraction task.Chinese word segmentation and part-of-speech (POS) tagging research on Chineseelectronic medical record (CEMR) is currently at a blank stage because of the lack ofannotated corpus on CEMR. Different from traditional data, CEMR contain a lot ofprofessional terms, acronyms and patterns. Therefore, POS tagging model trained oncommon areas can not be directly used for the POS tagging task of CEMR.In order to better study CEMR POS tagging technology, this paper constructed acorpus of CEMR word segmentation and POS tagging. we propose the scheme fromdata preprocessing to corpus annotation so as to obtain a higher annotation consistency,which is heuristic to build corpus with larger scale and higher quality on CEMR.Furthermore, the statistical lexical differences between CEMR, open-domain corpus andEnglish electronic health record are quantified, and systematic error analysis isperformed on POS tagging model trained on open-domain corpus. These works lay thefoundation for NLP technologies research on CEMR.Based on corpus analysis of CEMR, we propose an appropriate POS tagging modelfor CEMR for the first time. There are two stages: firstly, tag the raw sentencepreliminary, with a character-based joint word segmentation and POS tagging model toavoid error propagation and improve segmentation by utilizing POS information; then,to make good use of the characteristic that CEMR contains some language patterns, wecan revise the preliminary output and improve the accuracy of POS tagging by the ruleslearned from transformation-based error-driven learning method. For the cross domainannotation issue, the POS tagging is effectively improved by adjusting weights of thefeatures which appear only in CEMR. Our system achieves F1-scores of94.75%and93.82%on the test set of artificial annotated CEMR corpus.
Keywords/Search Tags:EMR, corpus construction, POS tagging, joint model, cross domain annotation
PDF Full Text Request
Related items