Font Size: a A A

Towards efficient data analysis and management of semi-structured data

Posted on:2011-07-20Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Tatikonda, ShirishFull Text:PDF
GTID:1448390002451159Subject:Computer Science
Abstract/Summary:
Over the last decade, there has been an enormous growth in both the amount and the complexity of online content that is collected and processed by humans and machines. Such a growth has spurred interest in flexible and fluid (semi-structured) data models that do not constrain the data to follow a fixed schema. Many applications ranging from bioinformatics to XML repositories, from software engineering to computational linguistics, are now generating and processing large amounts of semi-structured data. For these applications to reach their full potential, we need to build an effective set of tools to index, process, manage, and analyze such data. This dissertation focuses on a specific class of semi-structured data that is denoted using hierarchical tree objects. We specifically address the following questions pertaining to mining and managing tree-structured data: How can we provide quick access mechanisms to large semi-structured data stores? How can we discover hidden structural patterns from such data collections? How can we devise strategies to realize performance that is commensurate with modern computer architectures?;In the context of managing tree-structured data, first, we develop an indexing mechanism that extracts discriminant features from the database and indexes them using a simple tunable inverted structure. Such an index is complemented with an efficient holistic query processing technique that retrieves the matches by operating entirely on space-efficient sequential representation of trees. Second, we propose a framework that enables the development of application-specific hash functions that convert variable-sized graph and tree structured data into fixed-sized hash values. We demonstrate the usability of this framework by developing a hash-based distributed data placement service for semi-structured data. We argue that this service is capable of supporting large scale data management and data mining algorithms.;In the context of mining tree databases, first, we explore the role of succinct sequential data structures for efficiently discovering frequent tree patterns. Second, we propose a memory-conscious design" wherein the algorithms trade memory for redundant computations to improve the memory system performance. Third, we consider the case of deploying data mining workloads on modern multicore systems. Here, we demonstrate that the bandwidth to main memory becomes a precious shared commodity as one increases the number of cores present in the system. We present mechanisms to alleviate the bandwidth pressure and show their effectiveness. Fourth, we explore an adaptive task and data parallel algorithm design that facilitates effective parallelization in the presence of data and workload skew. This algorithm is integrated into a general purpose scheduling service that supports the development of adaptive and moldable algorithms for database and mining tasks. Finally, we develop a hash-based distributed data placement service that can support the development of large scale distributed data mining and data management applications.
Keywords/Search Tags:Data, Management, Service, Large
Related items