Background:Stroke has become the leading cause of death in China,increasing the burden of national health care.In the past decade,with the continuous development of medical level,the prognosis of stroke has been improved,and the prevalence and incidence of stroke are still on the rise,causing huge economic burden to the society.In stroke patients,longer hospital stays than 6 to 8 days are generally defined as longer length of stay(LOS),and a longer LOS is an independent factor for increased hospitalization costs.Patients do not always have a good prognosis,although the longer they stay in the hospital and the longer they are able to receive treatment.Study longer LOS influencing factors and prediction is helpful to reasonable distribution of medical resources,improve the flexibility of using beds to reduce management cost and cost of medical care,and can be based on these factors individualized diagnosis and treatment approaches for patients with discharge planning and planning,to reduce the in-hospital time,improve the satisfaction of patients and their families.Electronic medical records are a high-quality part of real-world big data,including structured data,semi-structured data and unstructured data,of which the unstructured part of information accounts for a significant proportion,accounting for more than 80% of the total,but its low utilization rate,cannot be directly used for traditional statistical analysis.The current Natural Language Processing technology(NLP)has been widely applied to extract information from unstructured electronic medical records,use NLP technology transform unstructured text for structured data can effectively reduce the time of the artificial reading text extraction data,improves the availability of unstructured data,thus can realize large-scale text automatic Processing.Electronic medical records are made up of different parts,each part of the content structure,the method of data extraction are also different,discharge summary mainly includes the patient’s diagnosis,symptoms and signs,treatment,etc.,when extract structured mainly relates to Named Entity Recognition(NER),migration study,similarity matching technology;The admission record contains the patient’s past history,personal history,physical examination and other information,and extracts information such as the history of tobacco and alcohol,mainly involving text classification technology.Objective:1.Develop NLP technical processes for the two texts of admission record and discharge summary in electronic medical records and convert unstructured data into structured data for analysis by using NER,transfer learning,text similarity matching,text classification and other technologies.2.Based on the extracted structured data,increase the amount of information,and construct the prediction model of whether the LOS is longer than 7 days,to provide more abundant information for clinical decision-making and resource allocation.Method:1.Research and application of NLP algorithm based on discharge summary.For ischemic stroke,six models were built to identify five medical named entities,including disease,drug,surgery,imaging examination and symptoms.Precision rate,recall rate,and F1 value were used as evaluation indexes,and the optimal model was used to extract entities and build a semi-structured database.To further extract structured data from semi-structured databases,we built three text similarity matching models.The evaluation index was accuracy,and the optimal model was used to construct the covariable extractor.Finally,based on the NER model of ischemic stroke,we explore the use of transfer learning technology to improve the entity recognition effect of hemorrhagic stroke.2.Research and application of NLP algorithm based on admission records.We constructed six models for text classification of smoking history and drinking history,and the evaluation indexes were precision rate,recall rate and F1 value.The optimal model was used to extract the information of patients’ smoking and drinking history and construct structured data.3.Construction of a structured database for stroke.In terms of the construction of structured database,the first page of medical record directly extracts data including gender,age,year of admission,grade of admission condition,etc.According to the results of text classification,the model with the best overall effect was adopted to extract the data of tobacco and alcohol history.In the discharge summary,through the covariable extractor after entity identification,the optimal model was adopted to extract the data of diseases,drugs,surgery,imaging examination and symptoms,respectively.4.Establishment of a prediction model for the length of hospital stay of stroke patients.Based on structured database,using Logistic regression,K nearest,naive Bayesian,and integrated study of random forests,Ada Boost,GBDT to predict whether LOS is longer than 7 days,in the training set to 5-fold cross-validation,finally using the test set to evaluate the performance of the model.Finally,the model is compared with the prediction model based on data from the first page of medical records.Results:1.Research results of NLP algorithm based on discharge summary.In the entity recognition of ischemic stroke,ERNIE+IDCNN+CRF model had the best overall recognition effect,with F1 value of 90.27%.In the aspect of disease entity recognition,Word2vec+BILSTM+CRF model has the best effect,F1 value is88.77%.In terms of drug entity recognition,BERT +IDCNN+CRF model had the best effect,with F1 value of 91.92%.In the aspect of entity recognition of imaging examination,BERT +IDCNN+CRF model had the best effect,with F1 value of89.82%.ERNIE+BILSTM+CR model had the best effect in entity recognition,with F1 value of 91.23%.In terms of entity recognition of symptoms,ERNIE+IDCNN+CRF had the best effect,with F1 value of 96.59%.In the comparison of text similarity matching models,the overall accuracy of ERNIE reached 99.11%,Bert was 97.64%,and ABCNN was 93.89%.In addition,Ernie was the model with the best matching effect in the matching of entities such as disease,imaging examination,surgery,and symptoms.In the entity recognition of hemorrhagic stroke,the overall F1 value of the model based on transfer learning was 86.62%,which was higher than the 75.01% of the ischemic stroke model directly and the 85.46% of the training obtained by combining the data of the two diseases.2.Research results of NLP algorithm based on admission records.In the classification of smoking history,the BERT model had the best effect,with the F1 value reaching 99.25%.In the classification of "non-smoking",the BERT model had the best effect,with the F1 value reaching 99.64%.In the classification of "ex-smoking",the F1 value of both BERT and ERNIE reached 98.73%,and the BERT model still had the best effect,with the F1 value reaching 98.14%.The overall classification effect of drinking history BERT + Text RNN was the best,and the F1 value reached 97.47%,among which ERNIE + Text CNN was the best in the category of "no drinking"(99.47%),BERT + Text CNN was the best in the category of "abstinent drinking"(96.10%),and BERT + Text RNN was the best in the category of "alcohol drinking"(95.06%).3.Results of the construction of a structured database of stroke.Based on the above study,a structured database of ischemic stroke patients was constructed,and a total of 6,053 ischemic stroke patients admitted to the hospital from2009 to 2019 were included in the database.The source of the database includes three parts: the first page of the medical record,the admission record,and the discharge summary.Different parts have different data formats.The data extracted from the first page of the medical record includes gender,age,admission year and admission condition grade.According to the research results of text classification,the model with the best overall effect was adopted for admission records.Bert model was used for the extraction of smoking history and Bert + Text RNN model was used for the extraction of drinking history to construct the structured data of patients’ tobacco and alcohol history.In the discharge summary,the covariate extractor after entity identification was used to extract the covariates related to disease,drugs,surgery,imaging examination and symptoms,respectively,using Ernie model.4.Comparison of models for predicting length of hospital stay in stroke patients.In the comparison of AUC values of the NLP based models for predicting LOS,the integrated learning model was better than other single classification models.The number of covariates extracted by ICD coding on the first page of medical records was 15,while the number of covariates extracted by NLP technology reached 43.A total of 8 predictors were included in the Logistic regression prediction model based on ICD coding on the first page of medical records,while a total of 16 predictors were included in the Logistic regression prediction model based on NLP technology.AUC values of the LOS prediction model constructed by covariate extraction using NLP technology was significantly higher than constructed by using the first page of medical records only,and the difference were statistically significant.Conclusion:For the entity recognition model of discharge summary of ischemic stroke,Ernie+IDCNN+CRF had the best overall recognition effect.In the text similarity matching model,Ernie has the best effect.In the entity recognition of hemorrhagic stroke,the model based on transfer learning is better than the model obtained by using the ischemic stroke model directly and combining the data of the two diseases.In the text classification of smoking history,the BERT model has the best overall effect,and in the text classification of drinking history,the BERT + Text RNN has the best overall effect.In the prediction model of LOS,the integrated learning model is better than other single classification model.The prediction effect of LOS prediction model constructed by covariate extraction with NLP technology was significantly higher than that constructed only by the first page of medical records,reflecting the effectiveness and practical application value of covariate extraction with NLP. |