Font Size: a A A

A CRF-based Semi-supervised Chinese Clinical Text Word Segmentation Research And Application

Posted on:2019-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:G Q XiaFull Text:PDF
GTID:2404330590492301Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Electrical medical record(EMR)is an important part of health care big data.An electronic medical record is recorded by medical staff and usually includes numerical data,medical images,and descriptive text.Analysis and utilization of electronic medical records are of great importance to improve the quality of medical services.The analysis and processing of descriptive texts rely on the Chinese word segmentation results of electronic medical records.Different from the English language,the Chinese language uses characters to represent language entities.Different combinations of characters represent different language components,such as words,subjects,predicates,adverbs,etc.The vocabulary of most frequently used Chinese characters contains about 5,000 characters.Different combinations of characters make up a rich and colorful Chinese vocabulary.When applying the natural language processing algorithm to process descriptive texts in Chinese electronic medical,a basic prerequisite is to extract the corresponding word sequence from the Chinese character sequence and perform tasks like part-of-speech tagging,semantic role extraction and document classification.With the mass production of electronic medical records,an urgent problem to be solved is how to process Chinese electronic medical record effectively.As the basic task of Chinese natural language processing,Chinese word segmentation has been widely studied.Researchers proposed a series of methods and have achieved rather good performance on open datasets.Traditional segmentation algorithms rely on supervised learning methods.Training and testing datasets are usually derived from corpus such as newswires.These corpus are relatively small but difficult to obtain.When applied on other domain corpus,such as medicine,law,finance,these supervised algorithms are faced with the problem of domain adaption.Certain expertise is required when obtaining annotated data in a specific area.The lack of professional knowledge will lead to the decline in quality of annotated data,thereby affecting the quality of the segmentation algorithm model.In addition,with the rapid development of the Chinese Internet,new words are constantly emerging,which challenges current word segmentation algorithms' ability for new word recognition.In this project,we propose a semi-supervised dictionary segmentation algorithm based on conditional random field(CRF)to handle word segmentation task for medical text.The word segmentation algorithm based on CRF treats the Chinese word segmentation task as a sequence annotation problem,predicts label sequence from the input variable sequence.The algorithm requires that the sequence of observed variables in one sample data corresponds to exactly one tag sequence.The proposed semi-supervised learning algorithm reduces this intensity,and allows a sequence of observation variables to correspond to multiple label sequences,and learn weights of features through these weakly annotated data.For obtaining weakly labeled data,we use a forward and backward maximum lexicon matching schema.To be specific,we use the dictionary to obtain two word segmentation sequences for the same characters sequence,taking the intersection as fully labelled part,difference set as a weakly labeled part.The training corpus are obtained by using a lexicon,and the final parameters of the model are learned by the semi-supervised CRF algorithm.With the use of the lexicon,the learned model can easily handle the domain migration problem and deal with new word recognition.Experimental results show that when applied on Chinese medical texts,the semi-supervised word segmentation algorithm can achieve a F score of 93.38%.Using semi-supervised CRF segmentation results,this project presents a medical text classification algorithm based on Latent Dirichlet Allocation(LDA).Using the word segmentation result of the semi-CRF algorithm,the LDA learns the topic vector of the document.Logistic regression(LR)algorithm is used to learn the classification model in the vector space.The experimental results show that,with a small amount of manual participation,the proposed classification algorithm can achieve a precision of 81.1%.
Keywords/Search Tags:Natural Language Processing, Conditional Random Field, Chinese Word Segmentation, Semi-supervised Learning, Latent Dirichlet Allocation, Text Classification
PDF Full Text Request
Related items