Unstructured clinical natural language texts,such as electronic medical records(EMR),medical encyclopedias,medical textbooks and medical Q&A,record clinical diagnosis and treatment activities.Using natural language processing(NLP)technology to structure clinical text could unearth the intrinsic value of data.In general NLP,BERT is the latest model,and ALBERT is a lightweight BERT,whose model parameters are 1/10 of the parameters of BERT.Both of them are often used in NLP tasks such as named entity recognition and text classification.To improve the performance of BERT in clinical NLP,scholars have trained BioBERT and EhrBERT based on biomedical text corpus and English EMR corpus respectively.Domestic research institutions have also launched MC-BERT and PCL-BERT for Chinese clinical NLP.However,BERT series of models for Chinese clinical NLP needs further study to be improved.In addition,there is a lack of lightweight BERT models for Chinese clinical NLP.The main contents of this study are as follows:1.To construct Chinese Clinical Language Understanding Evaluation(CCLUE)benchmark.In order to facilitate the comparison and analysis of the model performance,four datasets of Chinese electronic medical record named entity recognition(CEMRNER),Chinese medical text named entity recognition(CMTNER),Chinese clinical text classification(CCTC)and Chinese medical question-question matching(CMedQQ)were constructed for the task of named entity recognition and text classification,with precision(P),recall(R)and F1 as evaluation metrics.Based on CCLUE,the performance of BERT and ALBERT models for the Chinese general NLP and the performance of MC-BERT and PCL-BERT models for the Chinese clinical NLP were evaluated as baselines.The F1 vaules of BERT in CEMRNER,CMTNER,CCTC and CMedQQ datasets are 81.17%,65.67%,81.62%and 87.77%,respectively and that of ALBERT were 79.98%,62.42%,79.83%and 86.81%,respectively.2.To implement MedBERT and MedALBERT series of models for Chinese clinical NLP.A total of 650 million words of Chinese clinical natural language texts were collected from the Internet.Based on the model architecture of BERT and ALBERT,four models of MedBERT,MedBERT-wwm,MedALBERT and MedALBERT-wwm were trained in 19 days.Among them,the F1 values of MedBERT-wwm in CEMRNER,CMTNER,CCTC and CMedQQ datasets are 82.60%,67.11%,81.72%and 88.02%respectively,which are improved by 1.43%,1.44%,0.10%and 0.25%compared with BERT.The F1 vaules of MedALBERT-wwm in the four datasets are 81.28%,64.12%,80.46%and 87.71%respectively,which are 1.30%,1.70%,0.63%and 0.90%higher than those of ALBERT respectively.Compared with MC-BERT,MedBERT-wwm in CEMRNER,CMTNER and CCTC datasets improved by 1.67%,0.96%and 1.07%,respectively,and compared with PCL-BERT improved by 1.02%,0.09%and 1.45%,respectively.The performance of MedBERT-wwm in CMedQQ dataset is inferior to that of MC-BERT and PCL-BERT models,whose F1 values are 89.04%and 88.81%,respectively.3.To improve the performance of MedBERT-wwm by knowledge distillation.Using knowledge distillation technology,MedBERT-kd model was distilled by taking MC-BERT as teacher model,MedBERT-wwm as student model and CMedQQ as training set.The F1 of MedBERT-kd in CMedQQ is 89.34%,which is 1.32%,0.30%and 0.53%higher than that of MedBERT-wwm,MC-BERT and PCL-BERT,respectively.The performance of MedBERT-kd in CEMRNER,CMTNER and CCTC datasets is consistent with that of MedBERT-wwm.In conclusion,this study constructed the algorithm evaluation benchmark of Chinese clinical NLP,and implemented MedBERT and ALBERT series of models for Chinese clinical NLP.MedBERT series of models focus on the performance improvement of BERT,while MedALBERT series of models focus on the lightweight deployment.In addition,based on CCLUE,the performance of MedBERT series of models compared to the baseline models(BERT,MC-BERT,and PCL-BERT),and the performance of MedALBERT series of models compared to ALBERT were verified.Finally,open source sharing MedBERT and MedALBERT series of models(https://github.com/trueto/medbert)to promote the development of Chinese clinical NLP. |