Rapid resource transfer for multilingual natural language processing

Posted on:2006-09-10

Degree:Ph.D

Type:Thesis

University:University of Maryland, College Park

Candidate:Kolak, Okan

Full Text:PDF

GTID:2458390008973474

Subject:Computer Science

Abstract/Summary:

Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora, and (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods.; This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually.; Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was designed for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method.; Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data.

Keywords/Search Tags:

Language, OCR system, NLP, Rapid

Related items

1	Rapid Modeling And Lightweight Design Based On Patran Command Language
2	Research On Software Of Making 2-D Rapid NC Program System Of Turbine Compressor
3	Research On Software Of Making 2-d Rapid Nc Program System Of Turbine Compressor
4	JTangForm System Rapid Development Electronic Forms Research And Realize
5	Jtangform System Rapid Development Electronic Forms Research And Realize
6	Fingerspelled word recognition and rapid serial visual processing in hearing adults: A study of novice and expert sign language interpreters
7	Electric Forklift Controller Design And Implementation Of The Rapid Development Platform
8	Research And Implementation Of "Cloud-Edge-Device" Integrated Rapid Development For IoT Applications
9	Study On Rapid Development Technology For Complex Shape Product Based On RE/RP System Integrating
10	Study On Application Of Optics Navigation In BRT (Bus Rapid Transit) System