A super-max data mining benchmark by vertically structuring data

Posted on:2006-11-05

Degree:Ph.D

Type:Dissertation

University:North Dakota State University

Candidate:Serazi, Md. Masum H

Full Text:PDF

GTID:1458390008976389

Subject:Computer Science

Abstract/Summary:

Large data collections are emerging as important resources in an increasing number of disciplines. The volume of interesting data is already measured in terabytes and will soon reach petabytes. Along with the increase in data size, the data mining communities introduce different types of traditional data mining algorithms to discover hidden knowledge from these data repositories. However, the demand for computationally intensive results from large datasets is not satisfied by traditional data mining algorithms. The difficulty in processing datasets with many rows is often called the "curse of cardinality." The Predicate-tree (P-tree) is a vertically compressed, lossless, and data mining ready data structure that was developed by Perrizo to address the "curse of cardinality." In this dissertation, we develop an offset-based, single-level P-tree technology implementation. We also propose an implementation of the P-tree operation algorithms that solves the problem of the "curse of cardinality." In order to demonstrate that the proposed implementation scales to very high cardinality datasets, we have developed a benchmark suite. The three main goals of this benchmark are to provide comparisons among several versions of run-length compression techniques, to compare commonly used atomic data mining operations between the horizontal approach and the vertical approach, and to assess the performance of vertical data mining algorithms. The benchmark is implemented within the evaluation environment (test bed) called DataMIME(TM). An Application Programming Interface (API) for the P-tree structure and operations is developed in order to achieve the three main goals. The work in this dissertation also provides an easy-to-use and flexible software environment for data mining algorithm developers to implement new algorithms for vertically structured data and to test them extensively for performance on a distributed or a non-distributed setup. Finally, the results of extensive experiments carried out on the proposed benchmark with datasets as large as one quadrillion (1015) records are given as proof for the appropriateness of the use of P-trees, a vertical data structure to solve the "curse of cardinality" for very large, highly compressible datasets.

Keywords/Search Tags:

Data, Vertical, Benchmark, Large, Cardinality, P-tree, Curse

Related items

1	Bitmap Index As Effective Indexing For Low Cardinality Columns In Data Warehouse
2	Design And Implementation Of TPC-E Benchmark Testing System
3	An Algorithm Based On Virtual Vector For Measuring Host Cardinality Distribution
4	Benchmark Study For Non-relational Database And Achieve
5	Storage Optimization And Tree Vertical Merging Algorithm Of Tai Tree Editing Distance Algorithm
6	The Establishment And Application Of IPTAS Benchmark TCP Flow Data Sets
7	Similarity Search On Large-scale High-dimensional Data
8	Design And Implementation Of Data Generator For Big Data Benchmark
9	One To One Marketing Optimization Algorithm The Benchmark Verification Method Study
10	Bottleneck Detection And Performance Prediction For Large-scale Complex Systems