Font Size: a A A

Concept, topic, and pattern discovery using clustering

Posted on:2006-03-12Degree:Ph.DType:Dissertation
University:University of Southern CaliforniaCandidate:Chung, SeokkyungFull Text:PDF
GTID:1458390008960609Subject:Computer Science
Abstract/Summary:
In this dissertation, we present mining framework to extract useful pattern, concept, and topic from multi-dimensional dataset using clustering. In general, there are two kinds of datasets, incremental data and static data. Incremental data is the one where data items are inserted over time. However, not all datasets are incremental. In many cases, with static data, there is no incremental insertion. Thus, depending on the nature of data, relevant data mining algorithms should be developed. Thus, this dissertation is basically composed of two parts: incremental clustering for incremental data, and batch clustering for static data. For incremental data, we target news streams, and for static data, we target gene expression data.; In the first part, we propose a mining framework that supports the identification of useful patterns based on incremental data clustering. Given the popularity of Web news services, we focus our attention on news streams mining. A key challenging issue within news repository management is the high rate of document insertion. To address this problem, we present an incremental hierarchical document clustering algorithm using a neighborhood search. The novelty of the proposed algorithm is the ability to identify meaningful patterns (e.g., news events, and news topics) while reducing the amount of computations by maintaining cluster structure incrementally. In addition, we propose a topic ontology learning framework that utilizes the obtained document hierarchy. Experimental results demonstrate that the proposed clustering algorithm produces high-quality clusters, and a topic ontology provides interpretations of news topics at different levels of abstraction.; In the second part, we focus our attention on mining yeast cell cycle dataset. In molecular biology, a set of co-expressed genes tend to share a common biological function. Thus, it is essential to develop an effective clustering algorithm to identify the set of co-expressed genes. Toward this end, we propose genome-wide expression clustering based on a density-based approach. By addressing the strengths and limitations of previous density-based clustering approaches, we present a novel density clustering algorithm, which utilizes a neighborhood defined by k-nearest mutual neighbors. Experimental results indicate that the proposed method successfully identifies co-expressed and biologically meaningful gene clusters.
Keywords/Search Tags:Clustering, Data, Topic, Using, Mining
Related items