Algorithms and disk arrays for computing on massive data sets

Posted on:2002-12-13

Degree:Ph.D

Type:Dissertation

University:Princeton University

Candidate:Gum, Benjamin Brandon

Full Text:PDF

GTID:1468390011490257

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

In this work we investigate three approaches to computing on massive data sets. In the first approach, we give a sampling algorithm to estimate the maximum of a large data set. It gives estimates strictly better than the largest sample for an infinite family of data sets. In addition, the algorithm overshoots the true maximum of the worst case data set with probability at most 1/e + O(1/k), where the sample is of size k, which is much smaller than the size of the data set. Our proof is the result of a new extremal graph coloring theorem.; In the second approach, we consider algorithms which process queries in batches for greater efficiency. We give a batched algorithm for high dimensional hamming nearest neighbors using fast matrix multiplication. In addition, using two techniques, query data structures and presampling, we create batched algorithms for string matching, 1D nearest neighbor, 2D 2-approximate nearest neighbor, 2D point location, selection, inhull and halfplane intersection. These algorithms answer b queries to an unsorted data set of size n in O(nlogb) time using O(b2) space and O( n/B) I/O's to the data set, where B is the block size.; In the third approach, we consider the problem of designing disk arrays that reduce both read and write latency. We explore the disk array techniques of striping, mirroring, rotational replication, and eager writing. We give analytical models for three disk array combinations that reduce the seek and rotational delays of reads in a balanced fashion. Then we model the dramatic reduction in write latency of eager writing. Finally, we combine the ideas of mirroring, striping and eager writing. We discuss how to analytically model the latency of reads and writes, and how to maintain persistence.; Finally we combine algorithms and disk arrays for computing on massive data and discuss the resulting gains in performance. We show in general that D disks can provide D times the throughput for both random sampling and reading the entire dataset. By combining batched algorithms with these disk arrays we show how an even greater performance improvement can be achieved.

Keywords/Search Tags:

Data, Disk arrays, Algorithms, Computing, /italic

PDF Full Text Request

Related items

1	Research On Erasure-coding Technologies For Disk Arrays
2	A Research Of RAID1Technology Based On SSD Cache
3	Research On Data Layout Technologies For Disk Arrays
4	Trading capacity for performance in disk arrays
5	Improved Performability of Disk Arrays by the Use of Fuzzy Control
6	Research On High Available Technologies For Disk Arrays
7	Research On Key Techniques Of Applying MEMS-based Storage To Computer Systems
8	Analysis And Research Of Disk Failures In Data Centers
9	Research On Achieving Consistent Low Latency For Commodity SSD Arrays
10	Research On Costas Arrays Exhaustive Search Algorithms