High performance data mining techniques for large databases

Posted on:2006-01-19

Degree:Ph.D

Type:Dissertation

University:Northwestern University

Candidate:Liu, Ying

Full Text:PDF

GTID:1458390008471905

Subject:Computer Science

Abstract/Summary:

Data mining techniques are becoming prominent in various domains. Due to the latest technological advances in computer hardware, networks and data warehouses, very large data sets are available. Therefore, high performance parallel and distributed data mining techniques are in strong demand.; In this dissertation, we focus on high performance computing techniques for various data mining applications. In the scientific domain, we propose a parallel clustering algorithm, HOP, which partitions the data set into a balanced K-Dimensional tree and minimizes the inter-processor communication. An on-line data mining framework is proposed to integrate the parallel data mining techniques into scientific simulations so that the entire simulation cycles can execute automatically without human intervention or data input/output. In the business domain, we propose a scalable utility mining algorithm that discovers high utility itemsets that drive a large portion of the overall utility. A distributed traffic stream mining system is proposed. The central server discovers or updates the important patterns from huge amounts of historical stream data, while every sensor monitors and predicts the incoming data stream in a distributed fashion. This system is scalable and the response time and communication cost is low. In order to help hardware and software designers build systems more customized to data-intensive applications, we establish a benchmarking suite, MineBench, which covers eight representative data mining algorithms as well as parallel implementations. We characterize the computation kernels and memory usage hierarchy of MineBench programs on a real share memory parallel machine. Algorithms in this benchmark are all implemented and evaluated using synthetic and real data sets. Results show that our algorithms on parallel systems are scalable to large data sets and a large number of processors.

Keywords/Search Tags:

Data mining, Large data, High performance, Parallel

Related items

1	High-performance on-line analytical processing and data mining on parallel computers
2	High-performance visualization of large-scale time-varying volume data
3	Application And Research On Association Rule Mining Algorithm In Large Data Sets
4	Developing efficient algorithms for data mining large scale high dimensional data
5	Design And Research On A Parallel Performance Data Collection,Representation And Analysis Framwork For The SMP-Cluster Architecture
6	Parallel Data Mining Theory Research And Application
7	Research On Key Technologies Of Parallel Optimization For Multi-computing Platforms For Large-scale Applications
8	Large-scale Databases Association Rule Mining Algorithm
9	Research And Implementation Of Parallel RTI Based On High Performance Computing Environment
10	Study On Parallel For Association Rules Mining