Font Size: a A A

Research On The Construction Of A Wikipedia-based Chinese Named Entity Corpus

Posted on:2017-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z H XuFull Text:PDF
GTID:2348330488461987Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a subtask of Information Extraction(IE), named entity recognition(NER) is one of the most fundamental and essential tasks in Natural Language Processing research. It plays an important role in the tasks of machine translation, automatic question answering systems, and entity relation extraction, etc. The machine learning based method for NER requires large scale annotated corpora which are labor intensive and yet limited in scale and domain coverage. Regarding to this problem, this paper automatically constructs a Chinese named entity corpus based on the Chinese Wikipedia with its goals as follows:(1) Classifying entities in Chinese Wikipedia. Chinese Wikipedia has so far included over 860,000 entries, and most of which are named entities. Using effective features extracted from Wikipedia infoboxes and categories, along with additional Chinese-oriented extended and semantic features, Wikipedia entities are classified using SVMs.(2) Constructing a named entity corpus based on Wikipedia. We utilize the inter links within Wikipedia articles and the previous classification results to automatically annotate named entities in text, and thus have constructed a large scale named entity corpus by additional annotation and sentence selection. Finally, sampling statistics and closed test are conducted to evaluate the quality of the automatically constructed corpus.(3) Applying the automatically constructed corpus to NER. We compare the close test performances of both auto-annotated and manual-annotated corpora. And then hybrid-corpus test and cross-domain test are conducted to show the efficacy of the auto-annotated corpus.The experimental results show that high performance can be achieved for entity classification from Wikipedia items and whereon a named entity corpus is constructed. Although the corpus cannot rival manually annotated corpora, it shows assistance to manual-annotated corpora, and in cross-domain test it outperforms certain corpus in particular domains. Therefore the NER corpus generated from Wikipedia has full potential for further research and application.
Keywords/Search Tags:Named Entity Recognition, Wikipedia, Corpora, Automatic Annotation
PDF Full Text Request
Related items