Font Size: a A A

A super-max data mining benchmark by vertically structuring data

Posted on:2006-11-05Degree:Ph.DType:Dissertation
University:North Dakota State UniversityCandidate:Serazi, Md. Masum HFull Text:PDF
GTID:1458390008976389Subject:Computer Science
Abstract/Summary:
Large data collections are emerging as important resources in an increasing number of disciplines. The volume of interesting data is already measured in terabytes and will soon reach petabytes. Along with the increase in data size, the data mining communities introduce different types of traditional data mining algorithms to discover hidden knowledge from these data repositories. However, the demand for computationally intensive results from large datasets is not satisfied by traditional data mining algorithms. The difficulty in processing datasets with many rows is often called the "curse of cardinality." The Predicate-tree (P-tree) is a vertically compressed, lossless, and data mining ready data structure that was developed by Perrizo to address the "curse of cardinality." In this dissertation, we develop an offset-based, single-level P-tree technology implementation. We also propose an implementation of the P-tree operation algorithms that solves the problem of the "curse of cardinality." In order to demonstrate that the proposed implementation scales to very high cardinality datasets, we have developed a benchmark suite. The three main goals of this benchmark are to provide comparisons among several versions of run-length compression techniques, to compare commonly used atomic data mining operations between the horizontal approach and the vertical approach, and to assess the performance of vertical data mining algorithms. The benchmark is implemented within the evaluation environment (test bed) called DataMIME(TM). An Application Programming Interface (API) for the P-tree structure and operations is developed in order to achieve the three main goals. The work in this dissertation also provides an easy-to-use and flexible software environment for data mining algorithm developers to implement new algorithms for vertically structured data and to test them extensively for performance on a distributed or a non-distributed setup. Finally, the results of extensive experiments carried out on the proposed benchmark with datasets as large as one quadrillion (1015) records are given as proof for the appropriateness of the use of P-trees, a vertical data structure to solve the "curse of cardinality" for very large, highly compressible datasets.
Keywords/Search Tags:Data, Vertical, Benchmark, Large, Cardinality, P-tree, Curse
Related items