Experimental comparison of discriminative learning approaches for Chinese word segmentation

Posted on:2009-04-08

Degree:M.Sc

Type:Thesis

University:Simon Fraser University (Canada)

Candidate:Song, Dong

Full Text:PDF

GTID:2448390002492758

Subject:Language

Abstract/Summary:

Natural language processing tasks assume that the input is tokenized into individual words. In languages like Chinese, however, such tokens are not available in the written form. This thesis explores the use of machine learning to segment Chinese sentences into word tokens. We conduct a detailed experimental comparison between various methods for word segmentation. We have built two Chinese word segmentation systems and evaluated them on standard data sets.;Keywords. Word segmentation; machine learning; natural language processing.;The state of the art in this area involves the use of character-level features where the best segmentation is found using conditional random fields (CRF). The first system we implemented uses a majority voting approach among different CRF models and dictionary-based matching, and it outperforms the individual methods. The second system uses novel global features for word segmentation. Feature weights are trained using the averaged perceptron algorithm. By adding global features, performance is significantly improved compared to character-level CRF models.

Keywords/Search Tags:

Word, Chinese, CRF

Related items

1	Research For Chinese New Word Identification Based On Context-aware
2	Research And Implementation Of Chinese Word Segmentation Algorithm
3	Research Of Chinese Word Segmentation In BERSE
4	Research On Cross-domain Chinese Word Segmentation Method Based On New Word Discovery
5	Research On Chinese Word Segmentation Method Based On Word Embedding
6	The Research On Chinese Word Segmentation System Based On SVM
7	Research On Chinese Word Segmentation Integrating Pinyin And Tone Information
8	Research Of Combined Chinese Word Segmentation Method
9	Research On Chinese Word Segmentation Algorithm Based On News Text
10	The Research Of Unknown Chinese Work Recognition And Its Application To Chinese Input Method