Font Size: a A A

High performance data mining techniques for large databases

Posted on:2006-01-19Degree:Ph.DType:Dissertation
University:Northwestern UniversityCandidate:Liu, YingFull Text:PDF
GTID:1458390008471905Subject:Computer Science
Abstract/Summary:
Data mining techniques are becoming prominent in various domains. Due to the latest technological advances in computer hardware, networks and data warehouses, very large data sets are available. Therefore, high performance parallel and distributed data mining techniques are in strong demand.; In this dissertation, we focus on high performance computing techniques for various data mining applications. In the scientific domain, we propose a parallel clustering algorithm, HOP, which partitions the data set into a balanced K-Dimensional tree and minimizes the inter-processor communication. An on-line data mining framework is proposed to integrate the parallel data mining techniques into scientific simulations so that the entire simulation cycles can execute automatically without human intervention or data input/output. In the business domain, we propose a scalable utility mining algorithm that discovers high utility itemsets that drive a large portion of the overall utility. A distributed traffic stream mining system is proposed. The central server discovers or updates the important patterns from huge amounts of historical stream data, while every sensor monitors and predicts the incoming data stream in a distributed fashion. This system is scalable and the response time and communication cost is low. In order to help hardware and software designers build systems more customized to data-intensive applications, we establish a benchmarking suite, MineBench, which covers eight representative data mining algorithms as well as parallel implementations. We characterize the computation kernels and memory usage hierarchy of MineBench programs on a real share memory parallel machine. Algorithms in this benchmark are all implemented and evaluated using synthetic and real data sets. Results show that our algorithms on parallel systems are scalable to large data sets and a large number of processors.
Keywords/Search Tags:Data mining, Large data, High performance, Parallel
Related items