Font Size: a A A

Recognition Of Named Entity In Electronic Medical Records Based On Cascaded Conditional Random Fields

Posted on:2015-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y J WangFull Text:PDF
GTID:2268330428997796Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The appropriate model selection and effective feature design have an importantimpact on the efficiency of named entity identification. Electronic Medical Recordscommonly includes named entities with nested and complex structure. And due tothe unique nature of this field, the entity recognition models which have betterapplications in general areas are difficult to be directly applied in electronic medicalrecords. In addition, the current festures for named entity recognition are fromcharacters, parts of speech and other basic underlying characteristics used in generalmodels lacking of high-level features which are similar to human recognition, suchas semantic features.We firstly describe Hidden Markov models (HMM) and Maximum Markovmodels (MEMMs), and through which, we introduce our basic model-ConditionalRandom Fields (CRFs), including its definition, parameter estimation, and itsapplication for sequence labeling etc. The linear chain structure of CRFs modelinherited the advantages of the MEMMs that allow any features to be added. Inaddition,as only current state of observation needs to be considered in CRFs modelwithout strict independence requirements, it can overcome the inadequacies of HMM;and we can get its optimal solution in the global scope, solving the label biasproblem of MEMMs efficiently. As it provides a labeling framework with flexiblefeatures and global optimal solution at the same time, we select CRFs model as abasic model. Aimed at the complex structure and widespread internal nested ofnamed entity issues in electronic medical records, we have conducted a thoroughresearch of the CRFs model, using cascading CRFs to recognize the names ofdiseases in Chinese electronic medical records, and two types of clinical symptomsnamed entities. The main contents are as follows:(1) To establish the cascaded CRFs model framework, we divide complexElectronic Medical Records name entity recognition task into two relatively simpleinterrelated sub-layers. Firstly, we use the first layer of CRFs model to identify twocategories of entities, including body parts and name of the basic disease; then, therecognition result is passed to the second layer of CRFs model to identify two typesof complex entities including the name of disease and clinical symptoms. Thisprocess achieves an effective transfer of information, reduces the complexity of thetask. Effect of recognition is better than a single layer of CRFs model. (2) For the cascaded model framework, considering the structural nature of thenamed entities, we defined two sets of features as our self-defined features: entityfeatures and fusion features. Firstly, make the entity features as the output of the firstlayer of CRFs model, the fusion features are constructed by a combination of entityfeatures and part of speech features. The self-defined features reflect the nature ofthe internal structure of the named entities on the semantic level, reduce informationredundancy and decrease the amount of computation, and can help effectivelyidentify a complex entity with nested structure. The effect of identification by themodel is better than using a general combination of cascaded CRFs model.Meanwhile, the results also show that the proposed model can identify the namedentities not appeared in the training sample, which offers the possibility ofidentifying new terms in the corpus.(3) From90manually annotated Electronic Medical Records (30orthopedics,60cardiovascular and cerebrovascular), we randomly selected20parts of orthopedicand40parts of cardiovascular medical records for training, and the remaining30copies of medical records for testing. By comparing experiments of characteristicparameters we obtained optimal results, and the optimal parameters are set asfollows: the length of context window of the first layer model is3; the length ofcontext window of the second layer model is7; feature labeling is in wordgranularity; and boundary feature encoding format is BIOES. Under the optimalparameters, the obtained the overall F-score is up to97.64%, rate of accuracy is97.89%, the rate of recall is97.38%. The overall F-score is9.5%higher thancommon model using combined features, and5.6%higher than monolayer CRFsmodel, which proves the effectiveness of recognition of the cascaded CRFs modelwith the self-defined features for named entity recognition in Electronic MedicalRecords.
Keywords/Search Tags:Cascaded Conditional Random Fields, Conditional Random Fields, Electronic MedicalRecords, Recognition of Named Entity, Fusion Features
PDF Full Text Request
Related items