Font Size: a A A

An Incremental-styled Learning Chinese Word Segmentation System Based On Perceptron Algorithm Design And Implementation

Posted on:2016-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:B HanFull Text:PDF
GTID:2308330479990099Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In this paper, we propose an incremental-styled learning scheme in perceptron based Chinese word segmentation. Our method can perform continuous training over a fine tuned source domain model. Such scheme allows deliver ing model without annotated data and without re-training on these data. The experimental results on CTB5.0 and Zhuxian shows the scheme we proposed can significantly improve adaptation performance on Chinese word segmentation and achieve comparable performance with traditional method. And our further experimental analysis shows that our method can significantly reduce the resulted model size and obtain segmentation model with less time consumption.The model size of perceptron based Chinese word segmentation is usually too large. To solve this, we achieve a heuristic method to filter features, which filters upon the times of the parameter during the training phase. This method can effectively avoid the long-tail effect in NLP, which can help choosing relevant features better. The experimental results on perceptron based Chinese word segmentation and part of speech and dependency parsing shows the method we proposed can efficiently compress the model size with little accuracy reduction.We build an online custmoized Chinese word segmentation system based on the incremental-styled learning scheme and the model compression algorithm. Users can upload the target domain dictionary and annotated data. Our system will execute the incremental-styled learning algorithm in the background to train the incremental-styled target domain model for users. And then our system can provide customized service for users. Benefitting from the incremental-styled learning algorithm and model compression algorithm, multiple incremental-styled models are independent of each other. And the model size and training time is also well guaranteed.
Keywords/Search Tags:Chinese Word segmentation, Perceptron, Incremental-styled learning, Domain adaptation, Model compression, Customized Word segmentation
PDF Full Text Request
Related items