Clustering Transactional Data

Posted on:2009-09-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H Yan

Full Text:PDF

GTID:1488302750450654

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Clustering is a process that partitions a set of physical or abstract objects intoa set of disjoint clusters such that the objects within the clusters are close to eachother and the objects from different clusters are dissimilar from each other.As animportant tool of Data Mining,clustering methods are studied extensively in recentyears and applied in many application areas.Nowadays the volume of data that people owned becomes larger and larger.Even worse,the types and structures of data are more complex and versatile.Sothe existing clustering algorithms face two problems in real applications:1) Theperformance of algorithms comes down dramatically and,in worse cases,these al-gorithms may not be able to perform data analysis for lack of scalability;2) Manyclustering algorithms are limited in theory analysis,while the features of real dataand differences among applications are seldom considered.So the practicability ofthese algorithms are not good.Transactional data is a kind of special categorical data with large volume andhigh dimensions.Typical examples of transactional data are market basket data,web usage data,customer profiles,patient symptoms records,and image features.The research on scalable clustering algorithm for transactional data is both chal-lenging and meaningful.The main research topics and contributions of this thesisare as follows:(1) A scalable clustering framework for large-scale transactional data is pro-posed,i.e.SCALE(Sampling,Clustering structure Assessment,cLustering anddomain-specific Evaluation).The SCALE has the following four features.First,the set-based similarity measures Coverage Density and Weighted Coverage Den-sity are defined according to the features of transactional data.Second,a scalableclustering algorithm based on Weighted Coverage Density for transactional data isdesigned and implemented.Third,the clustering structure detecting can efficientlyreduce the search on parameter space of clustering algorithms.Finally,the domain-specific measures is used to help users selecting the optimal clustering result.Theexperimental results show that the scalable clustering algorithm powered by the SCALE framework can efficiently generate high quality clustering results.(2) The problem of detecting clustering structure for transactional datasets isstudied.Since the generic categorical data clustering structure detecting methodBKPlot has two weaknesses on dealing with transactional data,the DMDI mcthodis specially proposed for finding clustering structure in transactional data.Based onthe concept of Transactional-cluster-modes Dissimilarity,an agglomerative hierar-chical clustering algorithm,i.c.ACTD algorithm is designed and implemented.Thepair-cluster merging index values generated in the ACTD clustering procedure areutilized to find the candidate optimal cluster numbers.The experimental resultsshow that the new method often can identify the clustering structure of transac-tional data effectively.(3) The stability problem of transactional clustering results is studied.Theclustering procedures of traditional partition-based clustering algorithms are oftentrapped into local optimal results,while the SOM neural network has stable cluster-ing results but only for numerical data.So a GHSOM-based clustering method fortransactional data is proposed,i.e.GHSOM-CD method,which introduces the con-cept of Coverage Density into the GHSOM learning algorithm.The new methodimproves the neuron weight values updating way and the network training stopcriterion.The experimental results show that the GHSOM-CD method producescorrect and meaningful transactional clustering results.(4) The frequent itemsets compression problem is studied.To solve the problemthat frequent itemsets mining often generates a large collection of frequent itemsets,a dynamic clustering method,i.e.EESC method is proposed to compress frequentitemsets approximately.The EESC method is based on two frequent itemsets intra-cluster similarity measures:the expression similarity and the support similarity.The experimental results show that the approximate frequent itemsets method isfeasible and the compression quality is good.

Keywords/Search Tags:

Transactional data, Coverage Density, SCALE Framework, Clustering Structure, Frequent Itemsets

PDF Full Text Request

Related items

1	Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application In Simulation System
2	The Research And Implementation Of Mining Frequent Itemsets Algorithm Over Streaming Data
3	FP-Tree Based Mining Frequent Itemsets Over Data Streams
4	Research On Algorithm For Mining Frequent Itemsets Of Uncertain Data
5	Research On Top-K Frequent Itemsets Datamining Algorithm
6	Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application
7	Research On Frequent Closed Itemsets Mining Algorithms
8	An Algorithm And Context Analysis Of Mining Frequent Closet Itemsets
9	Research Of Frequent Itemsets Mining Algorithms Based On Vertical Data Format
10	Research On Algorithms For Mining Maximal Frequent Itemsets