Font Size: a A A

Research Of Word Representations On Biomedical Named Entity Recognition

Posted on:2016-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:H L HeFull Text:PDF
GTID:2308330461978006Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Biomedical named entity recognition is mainly to analyze biomedical resources effectively, and then identify named entities such as DNA, RNA, protein. It is the key link of biomedical information processing, and it is a prerequisite for the PPI extraction tasks.The main work of this paper:(1) The current research on biomedical named entity recognition mainly adapts statistical machine learning methods. The current research mainly focuses on the study of learning algorithm, and there is less research on the feature selection. Most of the features are formulated on the base of domain knowledge and experience of experts, this is a time-consuming and laborious process, and some features are redundant features, when combined this features with others, it will increase computation time and space complexity, affecting the classifier performance. This paper summarizes general linguistic features, orthographic features, morphological features and United feature according to the characteristics of biomedical named entities, and try to use two different feature selection methods to search for the optimum feature subset, which can remove the redundancy and improve the performance of the system. On the basis of feature selection, we use sequential forward selection method to optimize the feature template of conditional random fields, and the classification performance of the system is effectively improved by optimizing the selection.(2) The current entity recognition methods, which are based on machine learning, mainly depending on manually summarizized features, according to the domain knowledge and experience, and need to do experiments repeatedly for selecting the appropriate features. And these features rarely utilize the deep semantic information. To investigate the effect of semantic information on Named Entity Recognition, this paper attempts to obtain semantic information automatically from the large-scale unlabeled corpus, which can be downloaded from public database, such as PubMed, and get three kinds of word representation approaches, including word embeddings, cluster based on word embeddings, and Brown cluster. Three kinds of word representation are adopted as the features of CRF and SVM model,and combine optimal feature subset for semi-supervised learning. Comparative experiments are conducted under the same conditions:the dimension of word embeddings and the number of clusters. The experimental results show that the word representation approaches can learn the latent semantic information effectively and thus improve the performance of existing entity recognition systems based on machine learning. Experimental results (Precision, Recall, F-score) on public evaluation corpus BioCreative II GM reaches 91.11%,86.05%,and 88.51% respectively without the dictionary or any other external resources.
Keywords/Search Tags:Biomedical Named Entity Recognition, Feature Selection, Semi-supervised Learning, Word Representation, Cluster
PDF Full Text Request
Related items