Font Size: a A A

Maximizing resources for corpus-based natural language processing

Posted on:2002-07-27Degree:Ph.DType:Thesis
University:The Johns Hopkins UniversityCandidate:Ngai, GraceFull Text:PDF
GTID:2468390011494726Subject:Computer Science
Abstract/Summary:
With the popularization of the Internet and computers, there has been a push towards making computers more user-friendly. Since speech and language has been the main tool of human communication since the dawn of time, this has resulted in much interest in speech and natural language processing (NLP) research in recent years.; Much of NLP research can be roughly divided into two approaches: the linguistic approach, and the machine learning approach. Both of these approaches have their merits; but they both often require large amounts of expensive human labor. The linguistic approach requires human linguistic experts to develop a rule set for capturing phenomena in language, and these rules have to be rewritten for each change in language or domain. The machine learning approach, while more flexible, often requires large amounts of annotated data for training, which is both expensive and time consuming to develop. In the present booming economy, human labor—especially well-trained human labor—is extremely expensive and difficult to obtain for long periods of time.; Even though the cost of system building is definitely a problem, there has been little research in this area. This thesis aims to correct this by doing a thorough investigation into the costs of system building. First, a rule-based system is developed in which human subjects write rules to bracket base noun phrase chunks in a corpus. In situations where only a little training data is available, the human-written rules succeed in outperforming a state-of-the-art machine learning system. However, a major flaw of the drawn conclusions was: the time that the human needed to learn and develop the rules for the system could have been used in annotating training data for the machine learning system. An simulation of active learning is then performed, which results in the knowledge that in circumstances in which perfectly consistent data is available, one can reduce the amount of training data required by a system by 50%.; Armed with this knowledge, a full investigation into active learning under real-time human supervision is then undertaken. For the task of base noun phrase chunking, and for around a day's effort (6–8 hours), real-time annotation under active learning produces training data which a machine learning system can train from, with results that outperform those of human rule writing. An experiment into the purely rationalist approach using trained linguists to generate examples for the machine learning system to train upon also fails to outperform the real-time active learning system.; To show the extensibility of the results, the experiments are repeated on base noun phrase chunking in two additional low-density languages, and also on the task of prepositional phrase attachment. Even though the results are not as conclusive as those for base noun phrase chunking, due to task differences and difficulty, the conclusion that can be drawn from this thesis is that for situations in which not much data is available (this essentially rules out high-density languages such as English, French and Chinese, amongst others), active-learning-based annotation is a much better method of building an NLP system than human rule development.
Keywords/Search Tags:Base, System, Human, Language, NLP, Training data, Active learning
Related items