Font Size: a A A

The Analysis On Data Mining Algorithms And The Research On Their Parallel Formulations

Posted on:2005-03-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:C D SheFull Text:PDF
GTID:1118360125463952Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining (DM) is to extract knowledge from huge datasets, the purpose of which is to find the useful patterns hidden behind the data. After giving the brief introduction of the task and method of DM, we point out that discovering associations and mining of complicated data is becoming the focus of DM field. However, most available datasets are becoming enormous in size. Also, their high dimensionality motivates the need for efficient and scalable parallel algorithms. The design of such algorithms meeting above requirements is challenging.There are two classes of association rules mining algorithms. One is based on the Apriori algorithm, which generates large candidates when counting the frequent item sets. Several parallel and sequential algorithms have been proposed in this paper to solve the problem. IDD and HDD algorithms are efficient and scalable parallel methods applied in the discovery of association rules in the field of data mining. However, they become less effective due to the imbalance caused by distributing the candidates among the processors. Therefore, IDD and HDD are improved by means of introducing the approximate algorithms to solve the problem of load balance effectively. There are two approximate algorithms, one is called the online algorithm, and the other is named the offline algorithm. After that, we give the proof of their performance ratio; the complexity analysis of the improved IDD algorithm is also given. The other class of algorithms finds the associations without candidacy. And based on the efficient FP-growth algorithm, its implementation method of constructing the frequent pattern tree and mining frequent item sets is given for the shared memory parallel formulation. However, it results the imbalance due to the distributing the work among the processors. Therefore, a dynamic mechanism is proposed to solve the problem successfully.Discovery of sequential patterns is becoming increasingly useful and essential in DM fields. Many important knowledge discovery tasks in genomic require the analysis of DNA and protein sequences. The most time-consuming operation in discovering such patterns is the computation of the occurrence frequency of all sub-sequences (called sequential patterns) in the sequence database. In particular, there are three main classes of sequential pattern discovery algorithms. In general, projection-based frequent pattern discovery algorithms have been shown to substantially outperform the others. They still require a substantial amount of time. Therefore, this paper presents the data parallel formulation (DPF) and task parallel formulation (TPF) of tree-projection based algorithm. Moreover, the complexity analysis of the algorithm is given. The theoretical deduction shows that the DPF has certain scalabilty, and the TPF is more scalable. Our experiment shows that these two algorithms are capable of achieving good speedups and the task parallel formulation is more efficient. Aiming to the two points of data compressing and characteristic distilling during the pretreatment of image data process, this paper introduces a non-numeric parallel algorithm of continous hopfield neural network used to clustering of image data mining. The realization of data clustering is also the process of vector quantization of image compress. And the codebook of vector quantization can also be regard as the characteristic distilling for the image. Our work emphases on the pretreatment before the image data mining is to be processed, which predigest the presentation of the image and provide support to the next process for multi media data mining.
Keywords/Search Tags:data mining, parallel algorithms, association rules, sequential patterns, image data mining
PDF Full Text Request
Related items