Font Size: a A A

Research On Classification Of Chinese Comma And Colon

Posted on:2015-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:J J GuFull Text:PDF
GTID:2268330428998558Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Research on punctuations is one of the most fundamental task of discourse analysis.Effectively identifying the role of punctuations in sentences, is very important fordiscourse analysis. Therefore, it is great significance to inverstigate the research onpunctuation recognition. In this paper, our study is about the automatic classification ofpunctuation, which mainly includes the following three aspects.First, we propose a novel method for Chinese comma classification based on wordsegmentation and part-of-speech tagging. The key work is select and extract features. Weuse both two classifiers of maximum entropy model and CRF model to automaticallyidentify seven categories of Chinese comma. Experimental results show that classificationperformance of CRF model is better than maximum entropy model, and the accuracy of thetwo classifiers are as same as the method based on syntactic analysis. It demonstrates thatthe method we proposed for Chinese comma classification based on segmentation andpart-of-speech tagging is feasible.Second, we propose a fresh task to annotate and identify the Chinese colonautomatically. We collected new corpus which contain a lot of Chinese colon. Afterstatistics and analysis of the role of the Chinese colon in sentences, we classify the Chinesecolon into seven hierarchically organized categories. We annotate the classification labelsof Chinese colon which is based on the corpus with the information of wordsegmentation and part-of-speech tagging. Our experiment with two approaches thatautomatically classify the Chinese colon. The first approach is based on rules, and thesecond approach adopts maximum entropy model. The approach based on maximumentropy model gets better results in the experiment.Finally, we investigate a new method, which can indeed improve the accuracy ofChinese comma automatic classification by adding other punctuation classification labelsas new features. Through statistics and analysis of the CTB6.0, we found that Chinese colon and semicolon have influence on Chinese comma automatic classification. Theexperimental results demonstrate that the accuracy of Chinese comma classification will beimproved after added the classification labels of Chinese colon or semicolon as newfeatures. After added both two punctuations classification labels as new features, theaccuracy of Chinese comma automatic classification will be improved again.
Keywords/Search Tags:Discourse Analysis, Punctuation Identification, Maximum Entropy Model, CRF model
PDF Full Text Request
Related items