| In the current information age,a large amount of text information is recorded in various fields.In the medical field,there are a large number of unstructured text data such as electronic text medical records,drug descriptions,and disease records written by doctors.Knowledge and experience cannot be formed through numerous messy natural languages for mining and analysis.For subsequent mining and analysis tasks,it is extremely important to extract data with high efficiency and accuracy.Manual extraction is timeconsuming,laborious,and costly.Different people’s perceptions and standards will lead to differentiating results;using rules to extract generalization and poor robustness,there are many polysemy and ambiguity in the Chinese language.There are technologies that require a large amount of annotated data,and it is not good to text-represent a large number of professional terms that are not commonly used in the medical field.In view of the above existing limitations and technical difficulties,based on the above background,this paper mainly carries out the following research work:First,review,study and research mainstream text extraction algorithms,understand the advantages and disadvantages of traditional dictionary library extraction,statistical algorithm extraction,and neural network-based deep learning methods,and combine the actual problems in the medical field to name Design and optimize the entity recognition model architecture.Secondly,in the face of the particularity of the medical industry,a highly professional data set is needed.After finding the public data set CCKS2017 in the medical field,I feel that it cannot prove that it is indeed effective in the actual medical field application environment.Automatically download the text of drug instructions,and perform cleaning and preprocessing.After the secondary development of the Harbin Institute of Technology’s open source labeling tool,a small amount of manual labeling is performed to obtain unstructured data in the medical field as a data set.Thirdly,the exploratory data analysis(EDA)is performed on the data set.After analyzing the characteristics of the drug data set,labels such as prepositions and segmentation words are added,and a large number of long sentence entity problems are solved by using labeling techniques.Finally,combined with the actual situation in the medical field,a model structure for extracting entities for the medical industry is proposed,using a transfer learning method,a BERT parameter model trained on a large number of data unsupervised,and a labeled medical proprietary noun library The data is pre-trained to obtain text representations with higher quality.Since this embedding is predicted,it can solve the polysemy phenomenon in Chinese,and then combined with the deep learning model Bi-GRU in CCKS2017 and a small number of drug instruction data sets To verify it,and design 5 other deep learning models for comparison.Through experimental analysis,the results show that the models proposed in this paper have achieved the best results in F1-Score evaluation indicators.In summary,in the text extraction task in the medical field,this paper uses transfer learning to obtain a higher quality text representation through pre-training and build a deep learning model,which improves the accuracy of extracting medical entities. |