Semantic frameworks for document and ontology clustering

Posted on:2011-10-26

Degree:Ph.D

Type:Dissertation

University:University of Missouri - Kansas City

Candidate:Tong, Tuanjie

Full Text:PDF

GTID:1448390002455571

Subject:Computer Science

Abstract/Summary:

The Internet has made it possible, in principle, for scientists to quickly find research papers of interest. In practice, the overwhelming volume of publications makes this a time consuming task. It is, therefore, important to develop efficient ways to identify related publications. Clustering, a technique used in many fields, is one way to facilitate this. Ontologies can also help in addressing the problem of finding related entities, including research publications. However, the development of new methods of clustering has focused mainly on the algorithm per se, with relatively less emphasis on feature selection and similarity measures. The latter can significantly impact the accuracy of clustering, as well as the runtime of clustering. Also, to fully realize the high resolution searches that ontologies can make possible, an important first step is to find automatic ways to cluster related ontologies. The major contribution of this dissertation is an innovative semantic framework for document clustering, called Citonomy, a dynamic approach that (1) exploits citation semantics of scientific documents, (2) deals with evolving datasets of documents, and (3) addresses the interplay between algorithms, feature selections, and similarity measures in an integrated manner. This improves accuracy and runtime performance over existing clustering algorithms. As the first step in Citonomy, we propose a new approach to extract and build a model for citation semantics. Both subjective and objective evaluations prove the effectiveness of this model in extracting citation semantics. For the clustering stage, the Citonomy framework offers two approaches: (1) CS-VS: Combining Citation Semantics and VSM(Vector SpaceModel)Measures and (2) CS2CS: From Citation Semantics to Cluster Semantics. CS2CS is a document clustering algorithm with a 3-level feature selection process. It is an improvement over CS-VS in several aspects: (i) deleting the requirement of a training step, (ii) introducing an advanced feature selection mechanism, and (iii) dynamic and adaptive clustering of new datasets. Compared to traditional document clustering, CS-VS and CS2CS significantly improve the accuracy of clustering by 5-15% (on average) in terms of the F-Measure. CS2CS is a linear clustering algorithm that is faster than the common document clustering algorithms K-Means and K-Medoids. In addition, it overcomes a major drawback of K-Means/Medoids algorithms in that the number of clusters can be dynamically determined by splitting and merging clusters. Fuzzy clustering with this approach has also been investigated. The related problem of ontology clustering is also addressed in this dissertation. Another semantics framework, InterOBO, has been designed for ontology clustering. A prototype to demonstrate the potential use of this framework, has been developed. The Open Biomedical Ontologies (OBOs) are used as a case study to illustrate the clustering technique used to identify common concepts and links. Detailed experimental results on different data sets are given to show the merits of the proposed clustering algorithms.

Keywords/Search Tags:

Clustering, Document, Framework, Citation semantics, Ontology, CS2CS

Related items

1	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
2	The Research On Chinese Document Clustering Technology Based On Ontology
3	Research On Citation Sentiment Analysis Based On Semantics In Citation Context And Its Application
4	The Research Of Enterprise Document Retrieval Model Based On Ontology
5	Research Of Single-document Summarization Based On Semantics
6	Research And Application Of Document Clustering Based On Ontology
7	Citation Behavior Analysis Of Chinese Documents
8	A comparative study on ontology generation and text clustering using VSM, LSI, and document ontology models
9	Research And Application On Ontology-based 3D Model Management Framework
10	Study On The Theory And Practice Of Ontology And Ontology-based Agricultural Document Retrieval System--Floricultural Ontology Modeling