Font Size: a A A

The Study Of Graininess And Feature Selection For Chinese Named Entity Recognition

Posted on:2011-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z X LiuFull Text:PDF
GTID:2178330338479944Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
This work is supported by the Key Program of National Natural Science Foundation of China under grant (60736014), the Key Project of the National High Technology Research and Development Program of China under grant (2006AA010108) and Microsoft Research Asia IFP (grant no. FY09-RES-THEME-158).Named Entity Recognition (NER) is the basis of Natural Language Processing. Generally speaking, there are two kinds of frameworks for NER, including the rule-based and statistics-based methods. Different categories of named entity have different kinds of characteristics; therefore, many highlytargeted ways have pointed out, which achieved very outstanding results. This paper attempts to answer two questions of the Chinese named entity recognition:1. Which kind of tokens that should be taken as the graininess in NER task, characters or words.2. To different categories, what kinds of feature or feature combination are effective?First of all, we convert the named entity recognition into the sequence labeling problem. In theory, every machine learning models that can be used in sequence labeling can be used in named entity recognition. In this paper, we choose the widely used model——conditional random fields. CRF is a typical discriminative model and can avoid making very strict independence assumptions on the observations. Meanwhile, conditional probability can be formulated by assumption of generative models. Thus, the discriminative models can integrate various features, such as linguistic features, so they are more suitable for sequence labeling. Meanwhile, we use not only local context features within a sentence, but also global knowledge features extracting from other occurrences of each word in the whole corpus.In detail, this dissertation has conducted the following researches:1. Graininess for named entity recognition. This paper try two different graininess for Chinese named entity recognition, based on characters, or based on words. We design feature templates experiments, choose three characters and two words as the feature templates range. From the result, we know that the person names and the location names based on characters have better result than that based on words, which suggest that we do not need complex linguistic resources in order to achieve good results. But the organizations are more suitable based on words.2. Knowledge dictionaries. Due to the limited amount of training corpus, name dictionaries have been found to be very useful in the named entity recognition task, such as Person First Names, Person Last Names, Left Boundary Words, Right Boundary Words, Address Names and Useful Name Class Suffixes. With the above global features, the experiment based on words can solve the data-sparse problem.3. The CRF is designed as a general purpose tool, so we have to specify the feature templates in advance. In our training process, the features are derived from the training corpus and the template file. Based on the different graininess and knowledge dictionaries, we add different feature template file, such as the unigram feature templates, the word bigram feature templates, and the global knowledge bigram feature templates. These features can increase the overall performance of the experiment, especially the precision of the NER.
Keywords/Search Tags:named entity recognition, feature selection, graininess, knowledge dictionaries
PDF Full Text Request
Related items