Font Size: a A A

An investigation of several document classification algorithms leading to the design of an autonomous software agent for locating specific, relevant information on the World Wide Web

Posted on:2002-12-29Degree:Ph.DType:Thesis
University:California Institute of TechnologyCandidate:Lindal, JohnFull Text:PDF
GTID:2468390011498871Subject:Computer Science
Abstract/Summary:PDF Full Text Request
The goal of the research described in this thesis was to design an autonomous software agent that can locate specific, relevant information on the World Wide Web. The first chapter provides the motivation behind this project and a brief overview of the challenges associated with it. The next chapter presents the analysis which led to the development of a new, improved version of the computer program called ITRule. The improvements consist of a new algorithm for classifying documents that outperforms the previous one, significantly enhanced support for data exploration, i.e., the process of extracting information from raw data, and a new algorithm for quantizing numeric variables so they can be used by ITRule. The third part of this thesis compares the performances of three versions of ITRule, two versions of the Naive Bayes classifier, several neural networks, the decision tree algorithm called CART, and a linear support vector machine, in order to determine which one is best suited for selecting relevant web pages. An analysis of the test results shows that a new ITRule classification algorithm, based on cross validation combined with the J-measure, performs best. The fourth and final part of the thesis describes how some of these results were used in the design of a user friendly, autonomous software agent called Poirot that can help World Wide Web users stay up to date on new developments in topics of interest.
Keywords/Search Tags:Autonomous software agent, World wide, Web, Algorithm, New, Relevant, Information
PDF Full Text Request
Related items