Information organization algorithms and applications

Posted on:2002-05-15

Degree:Ph.D

Type:Dissertation

University:Dartmouth College

Candidate:Pelekhov, Ekaterina

Full Text:PDF

GTID:1468390011496454

Subject:Computer Science

Abstract/Summary:

We consider a problem of information overload: the conflict between vast amounts of data available in electronic form and inadequate tools for finding relevant information. To alleviate this problem we look at the information organization: organizing electronic documents as a hierarchy of topics and subtopics.; We developed a clustering algorithm, called star algorithm, for organizing both static and dynamic document collections. This algorithm computes clusters induced by natural topic structure of the space, and does not impose the constraint to use a fixed number of clusters. As a result, the algorithm guarantees a lower bound on the topic similarity, defined in terms of the standard vector space model, between the documents in each cluster. The star clustering algorithm is based on covering graphs with dense subgraphs that are star-shaped. We developed efficient variants of the star algorithms for dynamic and hierarchical organization. We demonstrate the effectiveness and the efficiency of the star algorithm against two popular clustering algorithms, the single-link algorithm and the average-link algorithm.; We use this imposed document organization as a front-end to a search engine, organizing documents retrieved in response to a query, and allowing user to quickly browse and identify relevant and irrelevant categories. We developed a method for visualizing the topic structure of the retrieved document set.; In dynamic collections, such as news wires, we use information organization to identify and track interesting stories. We developed an algorithm for document filtering based on document membership in the star clustering. We evaluate this algorithm against filtering around a topic centroid, and filtering based on a single cluster produced by the star, the single-link and the average-link clustering algorithms. We introduce the persistent query system, an interactive document filtering system which organizes documents retrieved in response to a query, and allows the user to formulate and maintain a persistent query as a subset of relevant clusters. The new documents added to the, relevant clusters are pushed to the user. We implement the persistent queries in the mobile-agent document retrieval environment, the Serval system.

Keywords/Search Tags:

Algorithm, Information, Document, Clusters, /italic

Related items

1	Text document topical recursive clustering and automatic labeling of a hierarchy of document clusters
2	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
3	The Research Of Multi-clusters Ib Algorithm For Imbalanced Data Set
4	Research And Implementation Of Chinese Cross-Document Coreference Resolution
5	Research On Authenticated Encryption Technology For Important Information Document System
6	Research On Determining The Number Of Clusters Based On Information Entropy
7	Application Research On Image Processing In Industrial And Commercial Enterprise Document Management Information System
8	An Extended Research On Information Retrieval Model Based On Document Relation
9	Research On Clustering Algorithm For Web Document By Incorporating Distribution Information
10	The Design And Implementation Of A Science And Technology Musemum Document Information Management System Based On WEB