Font Size: a A A

Developing semantic digital libraries using data mining techniques

Posted on:2006-01-29Degree:Ph.DType:Dissertation
University:University of FloridaCandidate:Kim, HyunkiFull Text:PDF
GTID:1458390005498802Subject:Computer Science
Abstract/Summary:
We define the semantic digital libraries as the digital libraries that can discover hidden, useful information from large amounts of stored data using data mining techniques such as clustering, classification, association rule mining, and visualization techniques. To build a semantic digital library, we first propose an integrated digital library system that provides multiple viewpoints of harvested metadata collections by combining search and data mining technologies. This system provides three value-added services: (1) the cross-archive search service provides a term view of harvested metadata, (2) the concept browsing service provides a subject view of harvested metadata, and (3) the collection summary service provides a collection view of each metadata collection. We also propose a text data mining method using a hierarchical self-organizing map algorithm to build concept hierarchies from Dublin Core metadata.;We then present a new classification method, called Associative Naive Bayes (ANB), to associate MEDLINE citations with Gene Ontology (GO) terms. We define the concept of class-support to find frequent itemsets and the concept of class-all-confidence to find interesting itemsets. In the training phase, ANB finds frequent and interesting itemsets and estimates the class prior probabilities and the probabilities of itemsets for all classes. Once the frequent and interesting itemsets are discovered in the training phase, new unlabeled examples are classified by the classification algorithm by incrementally choosing the most interesting itemset. Empirical test results on three MEDLINE datasets show that ANB is superior to both naive Bayesian classifier and Large Bayes. The results also show that ANB is more scalable than Support Vector Machines.;Finally, we present a text mining method that uses both text categorization and text clustering for building concept hierarchies for MEDLINE citations. The approach we propose is a three-step data mining process for organizing MEDLINE database: (1) categorizations according to Medical Subject Headings (MeSH) terms, MeSH major topics, and the co-occurrence of MeSH descriptors; (2) clustering using the results of MeSH term categorization; and (3) visualization of categories and hierarchical clusters. The hierarchies automatically generated may be used to support users in browsing behavior and help them identify good starting points for searching.
Keywords/Search Tags:Digital libraries, Semantic digital, Data mining, Using, MEDLINE, ANB
Related items