Font Size: a A A

Chinese Lexical Analysis And Named Entity Identification Using Hierarchical Hidden Markov Model

Posted on:2005-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:H K YuFull Text:PDF
GTID:2168360125468063Subject:Computer applications
Abstract/Summary:PDF Full Text Request
This thesis presents an approach for Chinese lexical analysis using hierarchical hidden Markov model (HHMM), which aims to incorporate Chinese word segmentation, Part-Of-Speech tagging, disambiguation and named entity identification into an integrated theoretical frame. A class-based hidden Markov model (HMM) is applied in word segmentation, and in this model unknown words are treated in the same way as common words listed in the lexicon. Named entity are recognized with reliability on roles sequence tagged using Viterbi algorithm in roles HMM. As for disambiguation, the author brings forth an n-shortest-path strategy that, in the early stage, reserves the top N segmentation results as candidates and covers more ambiguity. Various experiments show that each level in the HHMM contributes to Chinese lexical analysis. An HHMM-based system ICTCLAS was accomplished. The system not only ranked top in the official open evaluation, which was held by 973 project in 2002, but also achieved 2 first ranks and 1 second rank in the first international word segmentation bakeoff held by SIGHAN (the ACL Special Interest Group on Chinese Language Processing) in 2003. These achievements show that ICTCLAS is one of the best Chinese lexical analyzers. In a word, HHMM is effective to Chinese lexical analysis.
Keywords/Search Tags:Chinese lexical analysis, word segmentation, POS tagging, named entity identification, hierarchical hidden markov model, ICTCLAS
PDF Full Text Request
Related items