Font Size: a A A

A Statistics-Based Language Model Approach To Chinese Word Segmentation

Posted on:2007-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2178360185468217Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
There are two fundamental issues in word-level Chinese language processing, Chinese word segmentation and Chinese named entity recognition (NER). In most of the current systems, these two tasks are considered as separate ones and dealt with using different components in a cascaded or consecutive manner. However, we believe that these two problems are not separable in nature, and are better solved simultaneously. In this paper, we present a unified approach to the two fundamental issues.Statistical language model (SLM) has been successfully applied to many domains such as speech recognition, information retrieval and spoken language understanding. In particular, trigrams have been demonstrated to be highly effective for these domains. In this paper, we extend a word-based trigram modeling to Chinese word segmentation and Chinese named entity recognition, by proposing a unified approach to SLM.This paper is intended to address two fundamental issues in Chinese natural language processing (NLP) with a unified approach: Chinese Word Segmentation and Named Entity Recognition (NER). We present a method of using a class-based Language Model (LM), in which the definitions of classes concentrate on six types: Chinese.personal names and foreign personal names, Chinese location names and foreign location names, Chinese organization names and foreign organization names. The model consists of two sub-models: (1) a set of entity models, each of which estimates the generative probability of a Chinese character string given a class; and (2) a contextual model which estimates the generative probability of a class sequence. Our model thus provides a statistical...
Keywords/Search Tags:SLM, Chinese Word Segmentation, Chinese named entity recognition, Word-based trigram language model, Class-based language model
PDF Full Text Request
Related items