Font Size: a A A

Research On Lexical And Syntactic Analysis For Chinese Electronic Medical Record

Posted on:2018-11-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z P JiangFull Text:PDF
GTID:1368330566998858Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the arrival of the era of medical big data,more and more attention has been paid to the knowledge mining and utilization of electronic medical records(EMR).EMR is a kind of semi-structured data,the structured content provides convenient for automatic extraction and analysis of computer.Meanwhile,unstructured data is much larger than structured data,containing rich medical knowledge and health information of patients,but it is more difficult to be processed by computer.Unstructured data in EMR becomes the major barrier to knowledge acquisition of EMR.The knowledge acquisition process of EMR is generally divided into two stages of language analysis and information extraction.Lexical analysis and syntax analysis are the main language analysis methods to provide the necessary conditions for information extraction.In this thesis,we focus on special lexical and syntactic analysis models according to the sub language characteristics of Chinese electronic medical records(CEMR).Specific research tasks are part-of-speech(POS)tagging,chunking and parsing.In the three tasks,POS tagging is a basic technology in natural nanguage nrocessing(NLP),most research works of the second and the third are based on automatic POS tagging results,while chunking and parsing are structured processes of natural language,they can effectively improve information extraction for EMR,especially the relation extraction of named entity.This thesis includes the following four parts.First,a hierarchical parsing model is proposed based on multi-layer collaborative error correction to improve lexical and syntactic analysis models in general domain.Hierarchical parsing is an efficient complete parsing method,but the error accumulation problem is serious due to chunking layer by layer.A simple and feasible error predicting and collaborative correcting algorithm is proposed,which tracks predicting error results in this layer to the next layer and combines predicting scores of two layers to correct error collaboratively.The exp erimental results show that hierarchical parsing with error correction ensures parsing speed while achieves almost the same analytic precision of the mainstream Chinese parsers.Secondly,lexical and syntactic annotation corpus from CEMR is built.We firstly propose the scheme from data preprocessing to corpus annotation,and summarize a series of sub language characteristics of CEMR to lay the foundation for following works on lexical and syntactic analysis models.In pre-processing,for better representation of out-of-vocabulary(OOV)tokens in EMR and dependency between privacy categories,the character-level long short-term memory network(LSTM)and the word-level LSTM work in sequence,and a transition matrix is added to model this dependency.Experimental results show that improved LSTM can identify privacy information more effectively.Thirdly,POS tagging and chunking models are built according to the sub language characteristics of CEMR.On POS tagging,a character-based model joint word segmentation and POS tagging first introduced to CEMR,and a transformation-based error-driven learning method is used to revise the joint results in post-processing.On chunking,to solve the difference problem between different sections in CEMR,we propose a chunking model based on the word clustering features,the structured support vector machine(SSVM),and the group learning framework;to solve the difference problem between different departments in CEMR,we propose a cross-department chunking model based on the improved structural correspondence learning algorithm.Forthly,a parsing model is built according to the sub language characteristics of CEMR.To make the best of the sub language characteristic of strong pattern in CEMR,patterns reused are first formalized as tree fragments in CEMR,and a model integrating data oriented parsing(DOP)and hierarchical parsing is proposed.In the extraction stage of tree fragments,as the basis of the model,we propose a more efficient standard tree fragment algorithm to obtain a standard tree fragment bank and a partial tree fragment bank.Based on the two extracted tree fragment banks,a strategy matching word and part-of-speech(POS)synchronously and a maximal combination algorithm of tree fragments are proposed to improve the DOP,and alleviate the noise caused by invalid tree fragments.In summary,sub language characteristics are main behaviors on differences between CEMR and other text from the general limited domain.By means of corpus construction and statistical analysis,a series of sub language characteristics are found and successfully applied to build lexical and syntactic analysis models.This research has achieved some preliminary results,which we hope can further motivate the development of NLP for CEMR.
Keywords/Search Tags:CEMR, hierarchical parsing, data oriented parsing, chunking, POS tagging, structural correspondence learning, joint model
PDF Full Text Request
Related items