Research On Parallel Corpora-based Unsupervised Part-of-speech Tagging For Chinese

Posted on:2011-07-09

Degree:Master

Type:Thesis

Country:China

Candidate:J Sun

Full Text:PDF

GTID:2178360305976428

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and the explosive growth of the information on the Internet, the natural language processing (NLP) has been drawing more and more attention in recent years due to its importance in information processing. As an infrastructure in NLP, part-of-speech (POS) tagging is frequently used in all NLP tasks, such as syntactic parsing, machine translation, information extraction. The performance of POS tagging will have a great influence towards the performance of its follow-up tasks.This paper first explores unsupervised part-of-speech tagging for Chinese via monolingual corpus. It proposes a new unsupervised approach for Chinese POS tagging by using conditional random fields (CRFs). Firstly, it tags the pre-segmented texts with a directory. Then it holds out unknown words and tags them with specially designed heuristic rules. Finally the CRFs model is trained recursively to optimize the tagging results. In terms of feature selection which plays a critical role in POS tagging, this paper focuses on how to generate features from contextual information. Experiments on Chinese TreeBank from different training set size are made. It shows that our approach improves the accuracy of POS tagging over the four training sets with different sizes.We also learned that there exist many cases where it is hard to determine their POSs due to insufficient information from contextual texts. To this end, this paper proposes a novel method of parallel corpora-based unsupervised POS tagging for Chinese texts. This is done by the following steps: 1). semi-automatically constructing a parallel corpora; 2). adopting GIZA++ to get alignment between Chinese and English words; 3). POS tagging the English texts and incorporating the features driven from the English POS results into POS tagging model for Chinese. Experiments on four training sets with different sizes show that our method further improves the accuracy of POS tagging for Chinese. This suggests the effectiveness of our parallel corpora-based approach of unsupervised POS tagging for Chinese.

Keywords/Search Tags:

Natural language processing, Part-of-speech tagging, Parallel corpora, Conditional random fields, Unsupervised learning

PDF Full Text Request

Related items

1	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
2	Research On The Learning Of Integrating Chinese Word Segmentation With Part-of-Speech Tagging And Domain Adaption Approach
3	Unsupervised And Low-Resource Part-of-Speech Tagging Based On CRF Auto Encoder
4	Research On Part-of-Speech Tagging Algorithms Of Mathematical Corpus Based On Deep Learning
5	Chinese Word Found Its Part Of Speech Tagging
6	Research On Morpheme Analysis Based On Conditional Random Fields In Chinese Natural Language Understanding
7	Research On Short Utterance Semantic Recognition Method Based On Cascaded Conditional Random Fields
8	A Study On Chinese Location Names Recognition Based On Conditional Random Fields
9	Study Of Kazak Part-of-Speech Tagging Based Upon HMM
10	Research On Text Document Information Hiding