Research On Boosting Chinese Word Segmentation Accuracy With Partially Annotated Data

Posted on:2015-09-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Liu

Full Text:PDF

GTID:2298330422990914

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Statistical methods have been the dominant approach for Chinese wordsegmentation in recent study and have achieved high performance on the newswiredomain where annotated data are relatively rich. However, for the domains out ofnewswire where the resources are relatively rare, performance of the statisticalChinese word segmentation approach can drop severely. Manually annotating datafor Chinese word segmentation is expensive and there are different resources thatcontain incomplete annotation over the Internet, like lexicon, Wikipedia. In thispaper, for taking advantage of these data to improve the statistical Chinese wordsegmentationâ€™s performance on the non-newswire domains, we propose differentmethods to represent these data in a unified formatâ€“partial annotation. We modelthe both the fully and partially annotated data simultaneously with three differentstatistical models: the character based structured perceptron model, the word basedstructured perceptron model and the character based conditional random fieldsmodel. We conduct experiments on Penn Chinese Treebank (CTB) to compare thesemodels. Experimental results show that all these three models benefit from thepartially annotated data and the character based conditional random fields modelbenefits most. We also test our model with partially annotated data over the Internetnovel and SIGHAN Bakeoff2010domain adaptation test data. In these experiments,our method achieved competitive results compared to the previous study.

Keywords/Search Tags:

Domain adaptation, partially annotated data, conditional random fields, structured perceptron

PDF Full Text Request

Related items

1	Research On The Named Entity Recognition In The Domain Of Lack Of Annotated Data
2	Research On Fast Exact Structured Learning
3	Research On Domain Adaptation Of Chinese Word Segmentation With Multi-source Features And Data
4	A Study On Semantic Tagging Of Chinese Product Query Based On Conditional Random Fields
5	SAR Image Change Detection Based On Conditional Random Fields
6	Research On Online Detection Method Of Reputation Fraud Campaign Based On Conditional Random Fields
7	Research Of Multiview Sequence Data Modeling Based On Conditional Random Fields
8	A Study On Chinese Personal Name Recognition Based On Conditional Random Fields
9	An Self-adaptive BLP Optimal Model Employing Conditional Random Fields
10	Domain Adaptation For Semantic Segmentation