The Analysis On Data Mining Algorithms And The Research On Their Parallel Formulations

Posted on:2005-03-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C D She

Full Text:PDF

GTID:1118360125463952

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data mining (DM) is to extract knowledge from huge datasets, the purpose of which is to find the useful patterns hidden behind the data. After giving the brief introduction of the task and method of DM, we point out that discovering associations and mining of complicated data is becoming the focus of DM field. However, most available datasets are becoming enormous in size. Also, their high dimensionality motivates the need for efficient and scalable parallel algorithms. The design of such algorithms meeting above requirements is challenging.There are two classes of association rules mining algorithms. One is based on the Apriori algorithm, which generates large candidates when counting the frequent item sets. Several parallel and sequential algorithms have been proposed in this paper to solve the problem. IDD and HDD algorithms are efficient and scalable parallel methods applied in the discovery of association rules in the field of data mining. However, they become less effective due to the imbalance caused by distributing the candidates among the processors. Therefore, IDD and HDD are improved by means of introducing the approximate algorithms to solve the problem of load balance effectively. There are two approximate algorithms, one is called the online algorithm, and the other is named the offline algorithm. After that, we give the proof of their performance ratio; the complexity analysis of the improved IDD algorithm is also given. The other class of algorithms finds the associations without candidacy. And based on the efficient FP-growth algorithm, its implementation method of constructing the frequent pattern tree and mining frequent item sets is given for the shared memory parallel formulation. However, it results the imbalance due to the distributing the work among the processors. Therefore, a dynamic mechanism is proposed to solve the problem successfully.Discovery of sequential patterns is becoming increasingly useful and essential in DM fields. Many important knowledge discovery tasks in genomic require the analysis of DNA and protein sequences. The most time-consuming operation in discovering such patterns is the computation of the occurrence frequency of all sub-sequences (called sequential patterns) in the sequence database. In particular, there are three main classes of sequential pattern discovery algorithms. In general, projection-based frequent pattern discovery algorithms have been shown to substantially outperform the others. They still require a substantial amount of time. Therefore, this paper presents the data parallel formulation (DPF) and task parallel formulation (TPF) of tree-projection based algorithm. Moreover, the complexity analysis of the algorithm is given. The theoretical deduction shows that the DPF has certain scalabilty, and the TPF is more scalable. Our experiment shows that these two algorithms are capable of achieving good speedups and the task parallel formulation is more efficient. Aiming to the two points of data compressing and characteristic distilling during the pretreatment of image data process, this paper introduces a non-numeric parallel algorithm of continous hopfield neural network used to clustering of image data mining. The realization of data clustering is also the process of vector quantization of image compress. And the codebook of vector quantization can also be regard as the characteristic distilling for the image. Our work emphases on the pretreatment before the image data mining is to be processed, which predigest the presentation of the image and provide support to the next process for multi media data mining.

Keywords/Search Tags:

data mining, parallel algorithms, association rules, sequential patterns, image data mining

PDF Full Text Request

Related items

1	The Technics Of Mining Positive And Negative Association Rules Based On Sequential Patterns
2	Fast algorithms for mining association rules and sequential patterns
3	Research Of Mining Browsing Patterns Based On Association Rules Of Stream Data
4	Web Mining And Its Applications
5	Discovery On Association Rules And Sequential Patterns In Alarm Data Of Public Security
6	Practice Of CRM System Based On Data Mining
7	Research On Correlative Algorithms Of Web Usage Mining
8	Research On The Algorithms Of Association Rules And Hyperclique Mining
9	Research Of Pruning Redundant Association Rules Based On Sequential Patterns
10	Web Log Mining System Based On Association Rules And Sequential Patterns