Font Size: a A A

Research And Implementation Of Chinese Word Segmentation Based On Character Tagging Method

Posted on:2016-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y YouFull Text:PDF
GTID:2308330473957107Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many C hinese language processing tasks are based on words, so word segmentation is the first step of such systems. In Chinese sentences, whether a subset of characters form a word is based on the context. This is very different from English and other languages because they have space character as words delimiter. Since context is a relatively vague concept, which makes the word segmentation becomes a difficult task. With the maturity of statistical machine learning methods, its applications fields are expanding. C haracter tagging based word segmentation method gets a good performance in word segmentation tasks. Researchers use the idea of part-of-speech tagging to do word segmentation and makes the accuracy of C hinese segmentation method greatly improved.This thesis discussed two models, the maximum entropy model and the linear-chain conditional random fields model, and the details of the derivation of the models are studied. Then we develop two word segmentation methods based on the two models. We propose improvements for the training process and prediction process, which include better formula presentation, multi- thread method and a new predicting method. In addition, the paper discusses the effects on segmentation accuracy of more tags and more features.The results show that multi- thread method can reduce the training time. The results also show the proposed prediction methods is superior to traditional methods in performance, and slightly better than the traditional method in accuracy, which means including more information about the problem into the model will improve the segmentation accuracy. Meanwhile, the results also show that, method with linear-chain conditional random field model has higher accuracy for sequence labeling problem, but its longer training time may limit its application in certain occasions where the model require regular update. Finally, the results show that more expressive features help to improve the segmentation accuracy.
Keywords/Search Tags:character tagging based word segmentation, maximum entropy, conditional random field, POS tagging, L-BFGS(Limited-memory BFGS)
PDF Full Text Request
Related items