| Through information extraction on the documents of biomedical field,the degree of automation of the construction of domain knowledge base can bi imporved,which can further support the application of computers on document retrieval,diagnostic decision making,innovation examinations and predictive analysis in biomedical domain.Biomedical patents have innovative and informative content and sophisticated experimental verification,which give them high application value both academically and commercially.Anti-tumor drug is a field that has received much attention recently,and a great amount of related patents is released every year.However,the analysis work on these patents is mainly done manually,which is time consuming and expensive.Therefore,it is of great importance to study automated methods for extracting key information from anti-tumor drug patents.This thesis takes information extraction on Chinese anti-tumor drug patents as goal.After a preliminary analysis on the content of anti-tumor drug patents,we found that the key information mainly include entities like chemical compounds,desease names and drug targets,etc.Therefore,this thesis concentrates on the recognition of these entities in anti-tumor drug patents.The main researches and results in this thesis include:1)An anti-tumor drug patents entity recognition(ERATDP)dataset is constructed.By investigating public datasets,analyzing and referring to existing annotation guide,a projectdriven annotation guide is formed.Then an anti-tumor drug patents entity recognition dataset is constructed to support the training and evaluating of the recognition models.2)For the fact that the dataset has rich categories of entities,a combined method for entity recognition is studied.After analysis on the characteristics of the entities,a method based on the combination of dictionary-based,pattern-based and mechine-learning-based methods is designed.Meanwhile,to reduce the overfitting problem caused by the small amount of labelled data,text augmentation methods are applied to further improve the performance.3)Based on the combined entity recognition model,a prototype system for information extraction from Chinese anti-tumor drug patents is designed and implemented.Experiments are conducted on the ERATDP dataset,whose results showed that the combined method can achieve better overall performance than the well applied methods and the text augmentation strategy can improve the performace on the sparse classes.The work of this thesis can meet the requirement of a real world project on the one hand,on the other hand,it has reference value for the related work of entity recognition in specific domain. |