Font Size: a A A

Research On Boosting Chinese Word Segmentation Accuracy With Partially Annotated Data

Posted on:2015-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y J LiuFull Text:PDF
GTID:2298330422990914Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Statistical methods have been the dominant approach for Chinese wordsegmentation in recent study and have achieved high performance on the newswiredomain where annotated data are relatively rich. However, for the domains out ofnewswire where the resources are relatively rare, performance of the statisticalChinese word segmentation approach can drop severely. Manually annotating datafor Chinese word segmentation is expensive and there are different resources thatcontain incomplete annotation over the Internet, like lexicon, Wikipedia. In thispaper, for taking advantage of these data to improve the statistical Chinese wordsegmentation’s performance on the non-newswire domains, we propose differentmethods to represent these data in a unified format–partial annotation. We modelthe both the fully and partially annotated data simultaneously with three differentstatistical models: the character based structured perceptron model, the word basedstructured perceptron model and the character based conditional random fieldsmodel. We conduct experiments on Penn Chinese Treebank (CTB) to compare thesemodels. Experimental results show that all these three models benefit from thepartially annotated data and the character based conditional random fields modelbenefits most. We also test our model with partially annotated data over the Internetnovel and SIGHAN Bakeoff2010domain adaptation test data. In these experiments,our method achieved competitive results compared to the previous study.
Keywords/Search Tags:Domain adaptation, partially annotated data, conditional random fields, structured perceptron
PDF Full Text Request
Related items