| [Purpose] This study aims at the problems of Internet patients health question extraction that requires a lot of manual annotation,time cost,and low accuracy of entity recognition.Combined with transfer learning and deep learning methods,with the help of rich tagging corpora outside the field and reusable pre-trained models Knowledge,we build a crossdomain named entity recognition model(KNN-BERT-BiLSTM-CRF)for labeling scarce resources,to achieve named entity recognition tasks in the case of limited labeling of the target domain,and build a diagnosis of liver cancer and lung cancer.Annotate the corpus,and verify the validity of the model through inter-domain transfer experiments.[Methods] This study uses the Python crawler tool to crawl the online consultation texts of patients with lung cancer and liver cancer themes in the cancer question-and-answer communities of XYWY.com and 120 ask.com,and the original corpus is obtained through text cleaning.The construction of labeling rules refers to the labeling framework proposed by foreign researchers for consumer health problems.Nine types of physical labels that are closely related to the clinical diagnosis of cancer were selected.Multiple rounds of manual labeling were used to establish the corresponding two types of diseases.We uses the jieba toolkit to perform word segmentation on the target domain corpus;use the Skip-Gram method of the Word2 Vec model to obtain the text word vector representation;use Doc2 Vec to convert the text into sentence vectors;use the nearest neighbor method to select the instance transfer sample;use the BERT-base Pre-train model to obtain the vector representation of the text word;use the BiLSTM-CRF model,combined with context information to extract features,obtain the final prediction score through the transfer matrix,and predict the recognition results.[Results] The results show that the KNN-BERT-BiLSTM-CRF model combined with the instance transfer method had an optimal F value of 94.10% in the named entity recognition experiment of liver cancer,which is an improvement of 9.74% compared with the traditional deep learning method BiLSTM-CRF,proving that the transfer learning method had excellent entity recognition performance for limited labeled patient question texts.[Conclusion] This study proposes a named entity recognition method for cross-domain transfer learning based on the prior knowledge of large pre-trained model and out-of-domain tagging corpora for patients consultation texts with limited tagging resources.The experimental results show that the method can effectively identify entities such as personal information,disease symptoms,diagnosis and treatment,and drug use in the patient’s question text using only a small amount of labeled corpora,and make full use of existing data resources.At the same time,it provides reference for disease research and natural language processing research. |