Research On The Construction Of A Wikipedia-based Chinese Named Entity Corpus

Posted on:2017-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:Z H Xu

Full Text:PDF

GTID:2348330488461987

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As a subtask of Information Extraction(IE), named entity recognition(NER) is one of the most fundamental and essential tasks in Natural Language Processing research. It plays an important role in the tasks of machine translation, automatic question answering systems, and entity relation extraction, etc. The machine learning based method for NER requires large scale annotated corpora which are labor intensive and yet limited in scale and domain coverage. Regarding to this problem, this paper automatically constructs a Chinese named entity corpus based on the Chinese Wikipedia with its goals as follows:(1) Classifying entities in Chinese Wikipedia. Chinese Wikipedia has so far included over 860,000 entries, and most of which are named entities. Using effective features extracted from Wikipedia infoboxes and categories, along with additional Chinese-oriented extended and semantic features, Wikipedia entities are classified using SVMs.(2) Constructing a named entity corpus based on Wikipedia. We utilize the inter links within Wikipedia articles and the previous classification results to automatically annotate named entities in text, and thus have constructed a large scale named entity corpus by additional annotation and sentence selection. Finally, sampling statistics and closed test are conducted to evaluate the quality of the automatically constructed corpus.(3) Applying the automatically constructed corpus to NER. We compare the close test performances of both auto-annotated and manual-annotated corpora. And then hybrid-corpus test and cross-domain test are conducted to show the efficacy of the auto-annotated corpus.The experimental results show that high performance can be achieved for entity classification from Wikipedia items and whereon a named entity corpus is constructed. Although the corpus cannot rival manually annotated corpora, it shows assistance to manual-annotated corpora, and in cross-domain test it outperforms certain corpus in particular domains. Therefore the NER corpus generated from Wikipedia has full potential for further research and application.

Keywords/Search Tags:

Named Entity Recognition, Wikipedia, Corpora, Automatic Annotation

PDF Full Text Request

Related items

1	Automatic Approaches To Develop Large-scale TCM Electronic Medical Record Corpus For Named Entity Recognition Tasks
2	Semi-supervised Based Mobile Phone Named Entity Recognition
3	A Study On The Method Of Obtaining Equivalence Of Chinese And Cambodian Naming Entities
4	Named Entity Disambiguation Based On Wikipedia
5	Research And Implementation Of Named Entity Disambiguation Based On Wikipedia
6	Research On Named Entity Recognition Of Technology Project Notification
7	The Field Of Music, A Combination Of Rules And Statistical Named Entity Recognition
8	Mining Chinese-English Named Entity Pairs From Comparable Corpora
9	The Multi-strategic Research Of Chinese Weibo Entity And Wikipedia Entry Linking
10	Research On Named Entity Recognition And Disambiguation Based On Network Semantic Resource