Font Size: a A A

Information retrieval using statistical classification

Posted on:1996-02-10Degree:Ph.DType:Thesis
University:Stanford UniversityCandidate:Hull, David AFull Text:PDF
GTID:2468390014485860Subject:Statistics
Abstract/Summary:
In the classical information retrieval (IR) problem, the system must find all documents in a collection that are related to a topic defined by a user's query. A common approach to the IR problem is to represent documents and the query as vectors of term frequencies and rank the documents in the collection according to their inner product similarity with respect to the query. When a sample of evaluated documents is available in addition to the query (often called routing), the problem can be attacked using techniques based on statistical classification. In order for statistical classification to be a feasible approach, the system must produce a relatively small set of high quality feature variables. It turns out that individual words, due to their quantity and ambiguity, are not optimal features. Previous work has focused on a technique known as Latent Semantic Indexing (LSI), which applies the singular value decomposition to a term-document matrix and represents terms and documents by linear combinations of orthogonal indexing variables.; The research presented in this thesis accomplishes the following goals. It provides a thorough discussion of evaluation in information retrieval experiments. It introduces the concept of a local LSI decomposition. LSI is used separately on a set of documents in the local region surrounding each query, creating query-specific feature variables and making the LSI technique feasible for very large document collections. It applies the classification technique known as Discriminant Analysis to the routing problem and presents experimental results on two text collections. It demonstrates that using a local LSI decomposition improves retrieval performance and represents documents using a relatively small number of feature variables. It finds that Discriminant Analysis sometimes leads to additional performance gains but that more research is needed to determine the optimal size and shape of the local region.
Keywords/Search Tags:Information retrieval, Documents, Using, LSI, Statistical, Classification, Problem, Local
Related items