Crowdsourcing annotation for machine learning in natural language processing tasks

Posted on:2013-05-13

Degree:Ph.D

Type:Thesis

University:The Johns Hopkins University

Candidate:Zaidan, Omar F

Full Text:PDF

GTID:2458390008464062

Subject:Computer Science

Abstract/Summary:

Human annotators are critical for creating the necessary datasets to train statistical learners, but annotation cost and limited access to qualified annotators forms a data bottleneck. In recent years, researchers have investigated overcoming this obstacle using crowdsourcing, which is the delegation of a particular task to a large group of untrained individuals rather than a select trained few.;This thesis is concerned with crowdsourcing annotation across a variety of natural language processing tasks. The tasks reflect a spectrum of annotation complexity, from simple labeling to translating entire sentences. The presented work involves new types of annotators, new types of tasks, new types of data, and new types of algorithms that can handle such data.;The first part of the thesis deals with two text classification tasks. The first is the identification of dialectal Arabic sentences. We use crowdsourcing to create a large annotated dataset of Arabic sentences, which is used to train and evaluate language models for each Arabic variety. We also introduce a new type of annotations called annotator rationales, which complement traditional class labels. We collect rationales for dialect identification and for a sentiment analysis task on movie reviews. In both tasks, adding rationales yields significant accuracy improvements.;In the second part, we examine how crowdsourcing can be beneficial to machine translation (MT). We start with the evaluation of MT systems, and show the potential of crowdsourcing to edit MT output. We also present a new MT evaluation metric, RYPT, that is based on human judgment, and well-suited for a crowdsourced setting. Finally, we demonstrate that crowdsourcing can be used to collect translations. We discuss a set of features that help distinguish well-formed translations from those that are not, and show that crowdsourcing yields high-quality translations at a fraction of the cost of hiring professionals.

Keywords/Search Tags:

Crowdsourcing, Annotation, Tasks, New types, Language

Related items

1	Research Of The Algorithm Of Region-value Annotation In Crowdsourcing
2	Design And Implementation Of Corpus Annotation System Based On Crowdsourcing
3	Research And Implementation Of Crowdsourcing Annotation System For Still Image Visualactivity
4	Design And Implementation Of Data Annotation Crowdsourcing Platform System
5	Personalizing information retrieval using interaction behaviors in search sessions in different types of tasks
6	Research On Localization Fingerprint Annotation Method Based On Crowdsourcing
7	Research On Methods Of Recommending Crowdsourcing Tasks Incorporating With Interests And Capabilities Of Workers
8	Semantic Annotation Method Of Commodity Based On Classification Tree And Crowdsourcing In Electronic Commerce
9	Research On Annotation Quality Control In Crowdsourcing System
10	Research On Construction Of Microblog Hot Search Knowledge Graph Based On Crowdsourcing