| Automatic recognition of phenotypes from biomedical literature has always been a key component of biomedical information retrieval system.Previous works that address the task typically use dictionary-based matching methods,which can achieve high accuracy but suffer from lower recall due to the inability to identify any implicit synonyms.In recent years,named entity recognition based on deep learning has been widely used in biomedical field.It can recognize more implicit synonyms through automatic feature learning.However,most methods based on deep learning require a large amount of manual annotation data to train the model,which is costly.Even if the distant supervision method is used for automatic labeling to solve the problem of label data shortage,the generated positive labels cannot cover all phenotypic variants,and the distant supervision training set contains a large number of false negative labels,which affects the classification performance of the deep learning model.Based on the above research background,this thesis proposes a phenotypic named entity recognition method based on distant supervision.The main research contents are as follows:(1)Based on the problem that the HPO(Human Phenotype Ontology)dictionary set cannot cover all phenotypic variants and the distant supervision training set contains a large number of false negative labels,a syntactic-based phenotypic data augmentation algorithm—PhenoDA is proposed.In this thesis,a distant supervision training dataset was constructed using HPO and PubMed Central biomedical literature abstracts,and a phenotypic data augmentation algorithm was proposed to identify more phenotypic synonyms,thus positive labels can be expanded to reduce the number of false negative labels in distant supervision training data set.The effectiveness of the algorithm is verified by comparison experiments.(2)Based on the dictionary matching method can not recognize the implicit concept synonyms in the text,and the deep learning model will have the problem of over-fitting to recognize the synonyms,a phenotypic named entity recognition model—Pheno-Adv based on adversarial training is proposed.In this thesis,the adversarial samples and the original data are trained together,that is,the loss of the model is increased without modifying the structure of the original model,the regularization effect is produced,and the generalization ability of the model is improved.BioBERT model is used for automatic feature learning to identify more implicit synonyms.Finally,the validity of Pheno-Adv model is verified by ablation experiment and comparison experiment.(3)Pheno-Reviewer,an automatic human phenotype entity recognition system for biomedical literature is designed and implemented.With the biomedical literature uploaded by users as input(input format is PDF),the human phenotype in the text is automatically recognized and linked to the corresponding HPO concept label.Finally,the human phenotypic entity in the literature and the corresponding HPO ID are listed to the user in CSV form,and the download function is provided. |