The Study Of Graininess And Feature Selection For Chinese Named Entity Recognition

Posted on:2011-07-28

Degree:Master

Type:Thesis

Country:China

Candidate:Z X Liu

Full Text:PDF

GTID:2178330338479944

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

This work is supported by the Key Program of National Natural Science Foundation of China under grant (60736014), the Key Project of the National High Technology Research and Development Program of China under grant (2006AA010108) and Microsoft Research Asia IFP (grant no. FY09-RES-THEME-158).Named Entity Recognition (NER) is the basis of Natural Language Processing. Generally speaking, there are two kinds of frameworks for NER, including the rule-based and statistics-based methods. Different categories of named entity have different kinds of characteristics; therefore, many highlytargeted ways have pointed out, which achieved very outstanding results. This paper attempts to answer two questions of the Chinese named entity recognition:1. Which kind of tokens that should be taken as the graininess in NER task, characters or words.2. To different categories, what kinds of feature or feature combination are effective?First of all, we convert the named entity recognition into the sequence labeling problem. In theory, every machine learning models that can be used in sequence labeling can be used in named entity recognition. In this paper, we choose the widely used model——conditional random fields. CRF is a typical discriminative model and can avoid making very strict independence assumptions on the observations. Meanwhile, conditional probability can be formulated by assumption of generative models. Thus, the discriminative models can integrate various features, such as linguistic features, so they are more suitable for sequence labeling. Meanwhile, we use not only local context features within a sentence, but also global knowledge features extracting from other occurrences of each word in the whole corpus.In detail, this dissertation has conducted the following researches:1. Graininess for named entity recognition. This paper try two different graininess for Chinese named entity recognition, based on characters, or based on words. We design feature templates experiments, choose three characters and two words as the feature templates range. From the result, we know that the person names and the location names based on characters have better result than that based on words, which suggest that we do not need complex linguistic resources in order to achieve good results. But the organizations are more suitable based on words.2. Knowledge dictionaries. Due to the limited amount of training corpus, name dictionaries have been found to be very useful in the named entity recognition task, such as Person First Names, Person Last Names, Left Boundary Words, Right Boundary Words, Address Names and Useful Name Class Suffixes. With the above global features, the experiment based on words can solve the data-sparse problem.3. The CRF is designed as a general purpose tool, so we have to specify the feature templates in advance. In our training process, the features are derived from the training corpus and the template file. Based on the different graininess and knowledge dictionaries, we add different feature template file, such as the unigram feature templates, the word bigram feature templates, and the global knowledge bigram feature templates. These features can increase the overall performance of the experiment, especially the precision of the NER.

Keywords/Search Tags:

named entity recognition, feature selection, graininess, knowledge dictionaries

PDF Full Text Request

Related items

1	Knowledge Mining Based On Statistical Snowball Models
2	Recognition And Discovery Of Programing Design Network Resource Named Knowledge Entity
3	Study On Recognition Of Chinese Agricultural Named Entity With CRF
4	Research On Chinese Named Entity Recognition Based On Feature Enhancement
5	Chinese Named Entity Recognition Based On Conditional Random Fields
6	Research Of Word Representations On Biomedical Named Entity Recognition
7	Research On Named Entity Recognition And Entity Link Method For Short Text Questions
8	Named Entity Recognition Of Middle School Mathematics Knowledge Based On Deep Learning
9	Research On Biomedical Named Entity Recognition Based On Hybrid Model
10	Research On Chinese Named Entity Recognition With External Knowledge And Application In Medical Field