Font Size: a A A

Research On Parallel Corpora-based Unsupervised Part-of-speech Tagging For Chinese

Posted on:2011-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:J SunFull Text:PDF
GTID:2178360305976428Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the explosive growth of the information on the Internet, the natural language processing (NLP) has been drawing more and more attention in recent years due to its importance in information processing. As an infrastructure in NLP, part-of-speech (POS) tagging is frequently used in all NLP tasks, such as syntactic parsing, machine translation, information extraction. The performance of POS tagging will have a great influence towards the performance of its follow-up tasks.This paper first explores unsupervised part-of-speech tagging for Chinese via monolingual corpus. It proposes a new unsupervised approach for Chinese POS tagging by using conditional random fields (CRFs). Firstly, it tags the pre-segmented texts with a directory. Then it holds out unknown words and tags them with specially designed heuristic rules. Finally the CRFs model is trained recursively to optimize the tagging results. In terms of feature selection which plays a critical role in POS tagging, this paper focuses on how to generate features from contextual information. Experiments on Chinese TreeBank from different training set size are made. It shows that our approach improves the accuracy of POS tagging over the four training sets with different sizes.We also learned that there exist many cases where it is hard to determine their POSs due to insufficient information from contextual texts. To this end, this paper proposes a novel method of parallel corpora-based unsupervised POS tagging for Chinese texts. This is done by the following steps: 1). semi-automatically constructing a parallel corpora; 2). adopting GIZA++ to get alignment between Chinese and English words; 3). POS tagging the English texts and incorporating the features driven from the English POS results into POS tagging model for Chinese. Experiments on four training sets with different sizes show that our method further improves the accuracy of POS tagging for Chinese. This suggests the effectiveness of our parallel corpora-based approach of unsupervised POS tagging for Chinese.
Keywords/Search Tags:Natural language processing, Part-of-speech tagging, Parallel corpora, Conditional random fields, Unsupervised learning
PDF Full Text Request
Related items