Font Size: a A A

Chinese Named Entity Recognition With A Hybrid-Statistical Model

Posted on:2005-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:2168360155971763Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Named Entity Recognition (NER) is to classify certain proper nouns and significant phrases in a document into some predefined categories. It was first introduced as Message Understanding Conference (MUC) subtask in 1995 (MUC-6). Named Entities were defined as entity names (personal names, location names and organization names), temporal expressions (date expressions and time expressions) and number expressions. NER has recently attracted more and more attentions of Natural Language Process (NLP) researchers. Moreover, it is a key technology in information retrieval, information extraction, digital libraries, question answering, etc. It is also a big obstacle in lexical analysis.Up to now, NER research has been carried on many kinds of languages. Great successes are achieved on them, especially on English and other western languages. However, for Chinese and many oriental languages, NER is still on the way. Exception for the limit of technology, Chinese NER has its own difficulties, all of which will be the obstacles.There are two kinds of methods in NER: rule-based and statistic-based. The results on the rule-based method are more accurate. However the rule-based method is neither robust nor portable. It is bottlenecked by knowledge acquisition. On the other hand, the statistic-based method has higher speed and efficiency while consuming fewer resources, but its precision may be not as good as the rule-based method. Both methods have their advantages as well as disadvantages.After analyzing many used methods of NER, we desgin our NER method using a hybrid-statistical model. This model integrates two statistical models, which is Hide Markov Model (HMM) and Maximum Entropy Model (ME), and applies linguistic knowledge. If classified by granularity, linguistic knowledge consists of two kinds of information: knowledge in the set level and frequency information about using of characters and words. The fomer includes Part-of-Speech dictionary, indicative words and so on. Different sets have different effection on NER or Part-of-Speech tagging. And the frequency information will show the possibility of a word sequence being some kind of entity. MEM is used to calculating the observation probabilities of potential entity names. The indicative words used to invoke MEM include surnames, suffixes of location names and organization names. When the word sequence matches a rule of specific entity names, MEM is invoked. MEM is regarded as a sub-model of HMM. Viterbi algorithm is used to find the most likely Part-of-Speech sequence for the given word sequence.Our NER method includes two parts. The first one is personal names, location names and organization names recognition, which is the main job in this paper.The second one is temporal expression and number expression recognition. It is implemented on the frame of the former. The experiment shows that the hybrid-statistical model could achieve preferable results of Chinese NER. However, our method is still in the stage of experimentation. There is a lot of work to be done to improve the system. In the future we will focus work on ME model, especially on the feature selection and paramenter training.
Keywords/Search Tags:Named Entity Recognition, Hide Markov Model, Maximum Entropy Model
PDF Full Text Request
Related items