Font Size: a A A

Research On Chinese Comma Classification

Posted on:2014-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:S Q XuFull Text:PDF
GTID:2248330398465511Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The research of punctuation has been paid to more and more attention in naturallanguage processing. The comma is the most frequently used punctuation in Chinese andmost foreign language. The comma has the most wilder and flexible usage, so it is verydifficult to use or understand its function. This paper mainly studies the usages of Chinesecomma, and focuses on the different classification methods of comma used in Chinesesentence segmentation and discourse unit recognition. The main contents are as follows:(1) Two classification methods are summarized according to the statistic and analysison the CTB6.0(Penn Chinese Treebank). One of the classification methods is to considercomma as a sign of the sentence boundary, and then divides it into two major types, i.e.,EOS (End Of a Sentence) and Non-EOS (Not the End Of a Sentence). The otherclassification method is to consider comma as the boundary of the discourse units and alsoto anchor discourse relations between units separated by comma, and then divides it intoseven major types based on syntactic patterns.(2) The framework of Chinese sentence segmentation based on comma is described indetail. Firstly, it uses the first layer classifier to classify each comma according to variouseffective syntactic features. And then, the commas with low confidence produced by thefirst layer classifier are classified by the second layer classifier, according to the newfeatures. The experimental results prove that our hierarchical model achieves a higherperformance than that of the baseline.(3) A Chinese element discourse unit recognition and optimization model based oncomma is proposed. Firstly, it selects a set of effective syntactic features and constructs aMaximum Entropy model and a Conditional Random Field model to recognize the elementdiscourse unit respectively. Then, to capture the local and global information, it combinesthe sequence model and the probability method to improve the performance. The experimental results also prove that our model achieves a higher performance than that ofthe baseline.This paper proposes a Chinese sentence segmentation and element discourse unitrecognition approach based on comma. The experimental results prove the validity of themethod. It is conducive to the development of Natural Language Processing technologiesbased on discourse analysis.
Keywords/Search Tags:Chinese Sentence Segmentation, Maximum Entropy Models, Parsing tree, Integer Linear Programming
PDF Full Text Request
Related items