Font Size: a A A

Algorithms and disk arrays for computing on massive data sets

Posted on:2002-12-13Degree:Ph.DType:Dissertation
University:Princeton UniversityCandidate:Gum, Benjamin BrandonFull Text:PDF
GTID:1468390011490257Subject:Computer Science
Abstract/Summary:PDF Full Text Request
In this work we investigate three approaches to computing on massive data sets. In the first approach, we give a sampling algorithm to estimate the maximum of a large data set. It gives estimates strictly better than the largest sample for an infinite family of data sets. In addition, the algorithm overshoots the true maximum of the worst case data set with probability at most 1/e + O(1/k), where the sample is of size k, which is much smaller than the size of the data set. Our proof is the result of a new extremal graph coloring theorem.; In the second approach, we consider algorithms which process queries in batches for greater efficiency. We give a batched algorithm for high dimensional hamming nearest neighbors using fast matrix multiplication. In addition, using two techniques, query data structures and presampling, we create batched algorithms for string matching, 1D nearest neighbor, 2D 2-approximate nearest neighbor, 2D point location, selection, inhull and halfplane intersection. These algorithms answer b queries to an unsorted data set of size n in O(nlogb) time using O(b2) space and O( n/B) I/O's to the data set, where B is the block size.; In the third approach, we consider the problem of designing disk arrays that reduce both read and write latency. We explore the disk array techniques of striping, mirroring, rotational replication, and eager writing. We give analytical models for three disk array combinations that reduce the seek and rotational delays of reads in a balanced fashion. Then we model the dramatic reduction in write latency of eager writing. Finally, we combine the ideas of mirroring, striping and eager writing. We discuss how to analytically model the latency of reads and writes, and how to maintain persistence.; Finally we combine algorithms and disk arrays for computing on massive data and discuss the resulting gains in performance. We show in general that D disks can provide D times the throughput for both random sampling and reading the entire dataset. By combining batched algorithms with these disk arrays we show how an even greater performance improvement can be achieved.
Keywords/Search Tags:Data, Disk arrays, Algorithms, Computing, /italic
PDF Full Text Request
Related items