Font Size: a A A

Open-domain Named Entity Recognition And Hierarchical Category Acquisition

Posted on:2015-05-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:R J FuFull Text:PDF
GTID:1228330422492485Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Named Entity Recognition (NER) aims to identify and classify the name of entities in the text. The traditional NER task is mainly to recognize information units like names, including person, location and organization names. However, because of the limited cat-egories, the traditional NER cannot fully satisfy the requirements of other applications in the Natural Language Processing (NLP) field. Therefore, this thesis focuses on open-domain NER and hierarchical category acquisition to support many NLP applications, such as information extraction, information retrieval, open-domain question answering and machine translation.The characteristic of open-domain named entities (NE) includes two folds. On the one hand, they have more categories than traditional NEs and the categories cannot be predefined. On the other hand, they have more fine-grained categories than traditional NEs and the categories are organized as hierarchies. They cause that we cannot label training data and one NE may belong to many categories with different grains. Because of these challenges, we cannot use traditional sequence labeling methods to deal with open-domain NER. Research on open-domain NER has two important parts. One is boundary identification of NEs in unstructured text. The principle problem is how to construct and use training data. The other is semantic category extraction of NEs, which is a big challenge because the categories have different grains and cannot be predefined. This thesis covers the two issues and consists of four parts.The first part is a general framework to automatically label large-scale NER training data from bilingual parallel corpora. The shortage of NER training data may cause the domain overfitting problem of the NER models. However, manual annotation of NER training data is costly. To sovel this problem, we project the NE labels from English side to the Chinese side according to the word level alignments in bilingual parallel corpo-ra. Subsequently, we propose several strategies to select high-quality auto-labeled NER training data. Experimental results show that our approach can collect high-quality la-beled data. The NER model trained on the auto-labeled data achieves comparable result with the model trained on manually-annotated data. Moreover, combination of the two kinds of data can help improve the precision and recall of Chinese NER. The second part is a self-training approach to identify the boundaries of Chinese open-domain named entities in context. There is no training data for open-domain named entity. The open categories of the NEs cause that it is difficult to annotate training data. Due to the shortage of training data, we firstly generate a large-scale Chinese proper noun corpus based on parallel corpora. We also transform a Chinese dependency tree bank to a noun compound training corpus. Subsequently, we propose a self-training-based approach to combine the two corpora and train a model to identify boundaries of named entities. Besides, we also propose some new features, i.e. the verb-centered dependency relation features and the NE formation features, for this task. Our experiments show the proposed method can take full advantage of the two corpora and improve the performance of named entity boundary identification. Our experiments also show the effectiveness of the proposed features.The third part is a approach to discover semantic categories for Chinese open-domain NE by exploiting multiple information sources. Usually hypernyms of NEs can be used to represent their semantic categories. Given an entity name, we try to discover its hyper-nyms by leveraging knowledge from multiple sources, i.e., search engine results, encyclo-pedias, and morphology of the entity name. First, we extract candidate hypernyms from the above sources. Then, we apply a statistical ranking model to select correct hypernyms. A set of novel features is proposed for the ranking model. Experimental results demon-strate that the evidences from different sources can authenticate and complement each other to improve both precision and recall. Our approach outperforms the state-of-the-art methods on a manually labeled test dataset.The fourth part is a novel and effective method for the construction of semantic hierarchies of categories of NEs based on word embeddings. One NE may have many hypernyms representing its semantic category. There usually exist hypernym-hyponym relations among these hypernyms. We learn the piecewise semantic projections from NEs to their hypernyms based on word embeddings. We identify whether a candidate word pair has a hypernym-hyponym relation by using the projections. Our result outperforms the state-of-the-art methods on a manually labeled test dataset. Moreover, combining our method with a previous manually-built hierarchy extension method can further improve the performance.In conclusion, focusing on the challenges of open-domain NER and the character-istics of Chinese, this thesis conducts thorough study on automatically-labeling of NER, boundary identification, semantic category discovery and semantic hierarchy construc-tion. This work has accomplished several primitive achievements, which we hope can further motivate the progress of NLP high-level applications like information extraction, question answering and machine translation.
Keywords/Search Tags:Open-domain Named Entity Recognition, Corpus Automatical Construction, Boundary Identification, Category Extraction, Category Hierarchy Construc-tion
PDF Full Text Request
Related items