Incorporating background knowledge in document clustering

Posted on:2011-11-18

Degree:Ph.D

Type:Thesis

University:Michigan State University

Candidate:Fodeh, Samah Jamal

Full Text:PDF

GTID:2448390002956841

Subject:Computer Science

Abstract/Summary:

The explosive growth of unstructured text data in the present digital age has triggered an overwhelming interest in the development of robust and scalable document clustering techniques that can automatically partition and summarize the large tracts of documents. As document clustering is an unsupervised learning task, the quality of the partitions may be suboptimal due to the lack of guidance about which documents belong together in the same cluster. Augmenting the clustering algorithm with additional side information may potentially lead to better clusters. Towards this end, this thesis focuses on the use of background knowledge from an ontology such as WordNet to enhance the performance of document clustering algorithms. There are numerous challenges that must be overcome in order for such an approach to be successful. Most notably, how to effectively map the original words in the documents to their corresponding concepts in an ontology? The strategy used for concept mapping is important because it may increase the dimensionality of the data or introduce erroneous concepts, both of which have an adverse effect on the quality of the final partitions. In addition, the choice of ontology is another factor that should be taken into consideration since each ontology has its own structure, coverage, and content. Despite these challenges, a considerable amount of research has been done over the past decade on ontology-driven clustering. Yet the results from previous studies have not been conclusive. Some concluded that ontology helps improve clustering performance while others showed it is not that helpful. This thesis investigates the various factors that affect the performance of such clustering algorithms, including the choice of ontology, concept mapping approach, and benchmark datasets and baseline algorithms used for evaluation. The contributions of this thesis are as follows: First, a noun-based approach is proposed as a simple but more stringent baseline for clustering. Second, a novel unsupervised information gain approach is developed for extracting a core subset of semantic features from an ontology that can be effectively used for clustering. Third, a hybrid ontology-driven ensemble clustering method is proposed that combines the clusters of nouns and clusters of concepts extracted from an ontology. Finally, an approach for extracting concepts from Wikipedia is proposed and compared against existing works. These concepts are then used in conjunction with the concepts (synsets) from WordNet to study the effect of applying multiple ontologies on document clustering.

Keywords/Search Tags:

Clustering, Concepts, Ontology, Used

Related items

1	Incorporating background knowledge in document clustering
2	The Research Of Ontology Engineering Method And Application Based On Role Concepts
3	The Research Of Ontology Mapping System Based On Concepts Similarity
4	An approach to formalizing ontology driven semantic integration: Concepts, dimensions and framework
5	Research On Domain Ontology Concepts And Relations Learning Algorithm
6	The Reseach On Acquisition Method Of Reletionships Between Subject Concepts In Ontology Construction
7	Research And Application Of Ontology Semiautomatic Construction In Patent Information Retrieval System
8	Research On Literature Retrieval Based On Concepts Similarity
9	Improvement Of Chein Algorithm For Building Concept Lattice
10	Domain Ontology-based Information Retrieval Research