Excellent Cross-validation Based Model Selection Method For Chinese Word Segmentation System Design And Development

Posted on:2013-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:L X Shi

Full Text:PDF

GTID:2248330374456277

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

Chinese word segmentation is the basis of natural language processing, it has been widely applied to the syntactic analysis, semantic analysis, information retrieval, text classification, machine translation, question answering system and other research in natural language processing. Word segmentation system performance directly affects the performance of the entire research project, so many researchers has been attracted by the task of Chinese word segmentation.Chinese word segmentation technology has early been proposed in the1980s. The methods of segmentation includes the dictionary word segmentation, understanding word segmentation and statistical word segmentation. the current statistics-based segmentation techniques are more mature, and shows a better performance, so the statistics-based segmentation techniques is being widely used by the researchers. However, many of the basic theoretical problems have not been fully proposed and addressed in the statistic-based Chinese word segmentation models. Most researchers always partially pursue the highest F value without considering the stability of the segmentation system and whether the performance differences between multiple systems is significant.In this paper, we use conditional random field models for Chinese word segmentation experiments. Employing the tag set selection and feature template selection, we compare the blocked3Group2folds,5-fold and10-fold cross-validation methods, based on four tag set (BM, BMS, BMSE, BB2B3MSE) and six feature templates to choose the best Chinese segmentation model.In terms of the tag set, we can get the result from the experiment that BB2B3MSE marked catches higher accuracy than the other three tag set. Using BB2B3MSE, the highest F-value of word segmentation model made is95.58%.For the feature template, different from the intuition, the experimental results obtained using unary and binary feature combination from the window [-1,1] is superior to other location features combination.The final results are obtained based on the500million corpus of Shanxi University. The Precision, Recall and F1measure can achieve95.44%,95.31%,95.37%. For public access, a web-based Chinese segmentation system and the corresponding webservice interface are provided.

Keywords/Search Tags:

Chinese Word Segmentation, Conditional Random Fields Model, Tag Set, Feature Template, Cross-Validation

PDF Full Text Request

Related items

1	Research Of Chinese Word Segmentation With Conditional Random Fields
2	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
3	Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model
4	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
5	The Research On Chinese Word Segmentation Based On Conditional Random Fields In Big Data Environment
6	Research Of Named Entity Recognition Based On Conditional Random Fields
7	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
8	The Research Of Chinese Word Segmentation Based On CRF
9	The Key Technology On Chinese Word Segmentation Based On Bi-LSTM-CRF Model
10	The Research On Character-word Based Joint Decoding For Chinese Word Segmentation