Font Size: a A A

Cross-Lingual Word Sense Disambiguation for Languages with Scarce Resources

Posted on:2012-05-31Degree:M.ScType:Thesis
University:York University (Canada)Candidate:Sarrafzadeh, BaharehFull Text:PDF
GTID:2468390011465475Subject:Language
Abstract/Summary:
Word Sense Disambiguation has long been a central problem in computational linguistics. Word Sense Disambiguation is the ability to identify the meaning of words in context in a computational manner. Statistical and supervised approaches require a large amount of labeled resources as training datasets. In contradistinction to English, the Persian language has neither any semantically tagged corpus to aid machine learning approaches for Persian texts, nor any suitable parallel corpora.;The shortage of efficient, reliable linguistic resources and fundamental text processing modules for the Persian language has always been a challenge for the researchers investigating this language. In other words, the lack of sense tagged corpora has been an obstacle to employ highly accurate supervised approaches to sense disambiguation. On the other hand, highly reliable WSD systems are not available to automatically create large scale fully sense-tagged corpora for this language. This leads to a loop which we are going to solve by a cross-lingual approach.;In this thesis, we propose a cross-lingual approach to tagging the word senses in Persian texts. The new approach makes use of English sense disambiguators, a bilingual corpus (either comparable or parallel), and a newly developed lexical ontology, FarsNet. It overcomes the lack of knowledge resources and NLP tools for the Persian language. It can also be used to automatically create large sense tagged corpora. We demonstrate the effectiveness of the proposed approach by comparing it to a monolingual sense disambiguation approach for Persian. The evaluation results indicate that the cross-lingual method outperforms the monolingual one and its performance is comparable to the utilized English sense tagger.;Sense tagged corpora play a very crucial role in Natural Language Processing, particularly in Word Sense Disambiguation and Natural Language Understanding. Since the semantic annotations are usually performed by humans, the size of such corpora is limited to a handful of tagged texts. These corpora are not available for many languages with scarce resources including Persian.
Keywords/Search Tags:Sense disambiguation, Language, Resources, Corpora, Persian, Cross-lingual
Related items