Font Size: a A A

Excellent Cross-validation Based Model Selection Method For Chinese Word Segmentation System Design And Development

Posted on:2013-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:L X ShiFull Text:PDF
GTID:2248330374456277Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is the basis of natural language processing, it has been widely applied to the syntactic analysis, semantic analysis, information retrieval, text classification, machine translation, question answering system and other research in natural language processing. Word segmentation system performance directly affects the performance of the entire research project, so many researchers has been attracted by the task of Chinese word segmentation.Chinese word segmentation technology has early been proposed in the1980s. The methods of segmentation includes the dictionary word segmentation, understanding word segmentation and statistical word segmentation. the current statistics-based segmentation techniques are more mature, and shows a better performance, so the statistics-based segmentation techniques is being widely used by the researchers. However, many of the basic theoretical problems have not been fully proposed and addressed in the statistic-based Chinese word segmentation models. Most researchers always partially pursue the highest F value without considering the stability of the segmentation system and whether the performance differences between multiple systems is significant.In this paper, we use conditional random field models for Chinese word segmentation experiments. Employing the tag set selection and feature template selection, we compare the blocked3Group2folds,5-fold and10-fold cross-validation methods, based on four tag set (BM, BMS, BMSE, BB2B3MSE) and six feature templates to choose the best Chinese segmentation model.In terms of the tag set, we can get the result from the experiment that BB2B3MSE marked catches higher accuracy than the other three tag set. Using BB2B3MSE, the highest F-value of word segmentation model made is95.58%.For the feature template, different from the intuition, the experimental results obtained using unary and binary feature combination from the window [-1,1] is superior to other location features combination.The final results are obtained based on the500million corpus of Shanxi University. The Precision, Recall and F1measure can achieve95.44%,95.31%,95.37%. For public access, a web-based Chinese segmentation system and the corresponding webservice interface are provided.
Keywords/Search Tags:Chinese Word Segmentation, Conditional Random Fields Model, Tag Set, Feature Template, Cross-Validation
PDF Full Text Request
Related items