Font Size: a A A

Incorporating background knowledge in document clustering

Posted on:2011-11-18Degree:Ph.DType:Thesis
University:Michigan State UniversityCandidate:Fodeh, Samah JamalFull Text:PDF
GTID:2448390002956841Subject:Computer Science
Abstract/Summary:
The explosive growth of unstructured text data in the present digital age has triggered an overwhelming interest in the development of robust and scalable document clustering techniques that can automatically partition and summarize the large tracts of documents. As document clustering is an unsupervised learning task, the quality of the partitions may be suboptimal due to the lack of guidance about which documents belong together in the same cluster. Augmenting the clustering algorithm with additional side information may potentially lead to better clusters. Towards this end, this thesis focuses on the use of background knowledge from an ontology such as WordNet to enhance the performance of document clustering algorithms. There are numerous challenges that must be overcome in order for such an approach to be successful. Most notably, how to effectively map the original words in the documents to their corresponding concepts in an ontology? The strategy used for concept mapping is important because it may increase the dimensionality of the data or introduce erroneous concepts, both of which have an adverse effect on the quality of the final partitions. In addition, the choice of ontology is another factor that should be taken into consideration since each ontology has its own structure, coverage, and content. Despite these challenges, a considerable amount of research has been done over the past decade on ontology-driven clustering. Yet the results from previous studies have not been conclusive. Some concluded that ontology helps improve clustering performance while others showed it is not that helpful. This thesis investigates the various factors that affect the performance of such clustering algorithms, including the choice of ontology, concept mapping approach, and benchmark datasets and baseline algorithms used for evaluation. The contributions of this thesis are as follows: First, a noun-based approach is proposed as a simple but more stringent baseline for clustering. Second, a novel unsupervised information gain approach is developed for extracting a core subset of semantic features from an ontology that can be effectively used for clustering. Third, a hybrid ontology-driven ensemble clustering method is proposed that combines the clusters of nouns and clusters of concepts extracted from an ontology. Finally, an approach for extracting concepts from Wikipedia is proposed and compared against existing works. These concepts are then used in conjunction with the concepts (synsets) from WordNet to study the effect of applying multiple ontologies on document clustering.
Keywords/Search Tags:Clustering, Concepts, Ontology, Used
Related items