Font Size: a A A

Machine learning for information extraction in informal domains

Posted on:2000-03-07Degree:Ph.DType:Thesis
University:Carnegie Mellon UniversityCandidate:Freitag, Dayne BrianFull Text:PDF
GTID:2468390014461265Subject:Artificial Intelligence
Abstract/Summary:
Information extraction, the problem of generating structured summaries of human-oriented text documents, has been studied for over a decade now, but the primary emphasis has been on document collections characterized by well-formed prose (e.g., newswire articles). Solutions have often involved the hand-tuning of general natural language processing systems to a particular domain. However, such solutions may be difficult to apply to "informal" domains, domains based on genres characterized by syntactically unparsable text and frequent out-of-lexicon terms. With the growth of the Internet, such genres, which include email messages, newsgroup posts, and Web pages, are particularly abundant, and there is no lack of potential information extraction applications. Examples include a program to extract names from personal home pages, or a system that monitors newsgroups where computers are offered for sale in search of one that matches a user's specifications.; This thesis asks whether it is possible to design general-purpose machine learning algorithms for such domains. Rather than spend weeks or months manually adapting an information extraction system to a new domain, we would like a system we can train on some sample documents and expect to do a reasonable job of extracting information from new ones. This thesis poses the following questions: What sorts of machine learning algorithms are suitable for this problem? What kinds of information might a learner exploit in an informal domain? Is there a way to combine heterogeneous learners for improved performance?; This thesis presents four learners representative of a diverse set of machine learning paradigms---a rote learner (Rote), a statistical term-space learner based on the Naive Bayes algorithm (BayesIDF), a hybrid of BayesIDF and the grammatical inference algorithm Alergia (BayesGI), and a relational learner (SRV). It describes experiments testing these learners on three different document collections---electronic seminar announcements, newswire articles describing corporate acquisitions, and the home pages of courses and research projects at four large computer science departments. Finally, it describes a modular multistrategy approach which arbitrates among the individual learners, using regression to re-rank learners' predictions and achieve performance superior to that of the best individual learner on a problem.
Keywords/Search Tags:Information extraction, Machine learning, Problem, Learner, Informal, Domain
Related items