Font Size: a A A

High-performance on-line analytical processing and data mining on parallel computers

Posted on:2000-04-06Degree:Ph.DType:Dissertation
University:Northwestern UniversityCandidate:Goil, SanjayFull Text:PDF
GTID:1468390014964063Subject:Computer Science
Abstract/Summary:
Decision support systems are important in leveraging the information present in large scale data repositories in many scientific and business applications. Data analysis and data mining on these warehouses pose new challenges for traditional database systems. On-Line Analytical Processing (OLAP) and data mining operations require summary information on these data sets. Query processing for these applications require different views of data for analysis and effective decision making. The multi-dimensional data model is a natural and intuitive approach for such applications. Data mining techniques can be applied in conjunction with OLAP for an integrated solution. As data warehouses grow, parallel processing techniques need to be applied to enable the use of larger data sets and reduce the time for analysis, thereby enabling evaluation of many more options for decision making.; In this dissertation we focus on parallel processing techniques for scalable OLAP and data mining. A scalable parallel multi-dimensional infrastructure for OLAP integrated with data mining techniques like association rules and classification is designed and implemented. Multidimensional OLAP systems store data in a multidimensional structure on which analytical operations are performed. For large data sets and a large number of dimensions, multidimensional arrays are impractical and other efficient sparse data structures and techniques are required. We introduce a Bit-encoded sparse structure (BESS) for storage compression which allows aggregate operations on the compressed data. Pre-computed aggregate calculations in a Data Cube can provide efficient query processing for OLAP applications and data mining. We address the issues involved in parallel construction and maintenance of partial and full data cubes and answering OLAP queries and data mining tasks using them. In particular, issues relating to handling of large data sets, a large number of dimensions, sparse data structures, and parallelism are investigated. Algorithms are presented for our techniques which have been currently implemented on the IBM-SP2 parallel machine and can be ported to another parallel platform with minimal effort. Results show that our algorithms for OLAP and data mining on parallel systems are scalable to a large number of processors, large dimensions, and large data sets, providing a high performance platform for such applications.
Keywords/Search Tags:Data mining, On-line analytical processing, Parallel, Large data sets, Applications, Systems, Sparse data structures
Related items