Machine learning for information extraction in informal domains

Posted on:2000-03-07

Degree:Ph.D

Type:Thesis

University:Carnegie Mellon University

Candidate:Freitag, Dayne Brian

Full Text:PDF

GTID:2468390014461265

Subject:Artificial Intelligence

Abstract/Summary:

Information extraction, the problem of generating structured summaries of human-oriented text documents, has been studied for over a decade now, but the primary emphasis has been on document collections characterized by well-formed prose (e.g., newswire articles). Solutions have often involved the hand-tuning of general natural language processing systems to a particular domain. However, such solutions may be difficult to apply to "informal" domains, domains based on genres characterized by syntactically unparsable text and frequent out-of-lexicon terms. With the growth of the Internet, such genres, which include email messages, newsgroup posts, and Web pages, are particularly abundant, and there is no lack of potential information extraction applications. Examples include a program to extract names from personal home pages, or a system that monitors newsgroups where computers are offered for sale in search of one that matches a user's specifications.; This thesis asks whether it is possible to design general-purpose machine learning algorithms for such domains. Rather than spend weeks or months manually adapting an information extraction system to a new domain, we would like a system we can train on some sample documents and expect to do a reasonable job of extracting information from new ones. This thesis poses the following questions: What sorts of machine learning algorithms are suitable for this problem? What kinds of information might a learner exploit in an informal domain? Is there a way to combine heterogeneous learners for improved performance?; This thesis presents four learners representative of a diverse set of machine learning paradigms---a rote learner (Rote), a statistical term-space learner based on the Naive Bayes algorithm (BayesIDF), a hybrid of BayesIDF and the grammatical inference algorithm Alergia (BayesGI), and a relational learner (SRV). It describes experiments testing these learners on three different document collections---electronic seminar announcements, newswire articles describing corporate acquisitions, and the home pages of courses and research projects at four large computer science departments. Finally, it describes a modular multistrategy approach which arbitrates among the individual learners, using regression to re-rank learners' predictions and achieve performance superior to that of the best individual learner on a problem.

Keywords/Search Tags:

Information extraction, Machine learning, Problem, Learner, Informal, Domain

Related items

1	Informal Learning Spaces In The Libraries Of Universities
2	Semi-automatic building extraction in informal settlements from high-resolution satellite imagery
3	Research On Related Technologies Of Domain Information Extraction
4	Research On The Deep Web Interface Schema Matching Based On The Machine Learning
5	Web2.0 The Background Of Informal Learning Environments Research
6	Informal Learning In Web2.0 Times
7	Research On Learning Resources Recommendation Based On Online Learner Test Results And Comment Data
8	The Research Of Land Cover Information Extraction With Remote Sensing Data Based On Machine Learning
9	Study On Adaptive Learning Recommendation Method Based On HIN
10	Retrieval of informal information from design: A thesaurus based approach