Font Size: a A A

Design And Implementation Of Efficient Chinese Word Segmentation And Postagging System Based On Perceptron Algorithm

Posted on:2014-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z L DengFull Text:PDF
GTID:2268330422451688Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Word segmentation and Part-of-Speech tagging are fundamental researchprojects of natural language process and are the foundation of other nlp tasks. Theycan determine the final results of other tasks. A high-efficiency word segmentationand pos tagging system is not only important in terms of academic value, but alsohas important application value.This work focuses on constructing an efficient word segmentation and postagging system with excellent performance. The major reaserch contents include themethod of Chinese word segmentation and pos tagging combining statistic modeland dictionary, the optimization of system efficiency, performance improvementand incremental training based on perceptron algorithm.By using the method combining statistic model and dictionary, we achieved acomparable performance in word segmentation and pos tagging and by integratingdictionary information into statistic model we implement the domain adaption forChinese word segmentation and efficiency improvemen for pos tagging. We thenimplement perceptron parallel training algorithm and improve triaing efficiencysignificantly without great loss of performance. We also decrease the memoryrequirement and accelerate test speed by compressing model file. Meanwhile, wealso improve performance of pos tagging by using large-scale unlabeled data. Basedon the advantages of online algorithm, three different incremental training methodsbased on perceptron algorithm are proposed and the validity of new methods havebeen affirmed through experiments. Finally, analyse the causation of fail in cross-domain Chinese word segmentation deeply and stacked learning frame is applied incross-domain Chinese word segmentation.Experimental results show that our system achieves the state-of-artperformance of Chinese word segmentation and pos tagging and by using paralleltraining we can greatly improved training efficiency. The results also show that theincreamental method proposed in this paper is valid in same domain dataset forChinese word segmentation and pos tagging and stacked learning frame is effectivefor cross-domain Chinese word segmentation.
Keywords/Search Tags:Word segmentation, Part-of-Speech tagging, Perceptron algorithm, parallel training, Incremental Training
PDF Full Text Request
Related items