Font Size: a A A

Clustering Transactional Data

Posted on:2009-09-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:H YanFull Text:PDF
GTID:1488302750450654Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is a process that partitions a set of physical or abstract objects intoa set of disjoint clusters such that the objects within the clusters are close to eachother and the objects from different clusters are dissimilar from each other.As animportant tool of Data Mining,clustering methods are studied extensively in recentyears and applied in many application areas.Nowadays the volume of data that people owned becomes larger and larger.Even worse,the types and structures of data are more complex and versatile.Sothe existing clustering algorithms face two problems in real applications:1) Theperformance of algorithms comes down dramatically and,in worse cases,these al-gorithms may not be able to perform data analysis for lack of scalability;2) Manyclustering algorithms are limited in theory analysis,while the features of real dataand differences among applications are seldom considered.So the practicability ofthese algorithms are not good.Transactional data is a kind of special categorical data with large volume andhigh dimensions.Typical examples of transactional data are market basket data,web usage data,customer profiles,patient symptoms records,and image features.The research on scalable clustering algorithm for transactional data is both chal-lenging and meaningful.The main research topics and contributions of this thesisare as follows:(1) A scalable clustering framework for large-scale transactional data is pro-posed,i.e.SCALE(Sampling,Clustering structure Assessment,cLustering anddomain-specific Evaluation).The SCALE has the following four features.First,the set-based similarity measures Coverage Density and Weighted Coverage Den-sity are defined according to the features of transactional data.Second,a scalableclustering algorithm based on Weighted Coverage Density for transactional data isdesigned and implemented.Third,the clustering structure detecting can efficientlyreduce the search on parameter space of clustering algorithms.Finally,the domain-specific measures is used to help users selecting the optimal clustering result.Theexperimental results show that the scalable clustering algorithm powered by the SCALE framework can efficiently generate high quality clustering results.(2) The problem of detecting clustering structure for transactional datasets isstudied.Since the generic categorical data clustering structure detecting methodBKPlot has two weaknesses on dealing with transactional data,the DMDI mcthodis specially proposed for finding clustering structure in transactional data.Based onthe concept of Transactional-cluster-modes Dissimilarity,an agglomerative hierar-chical clustering algorithm,i.c.ACTD algorithm is designed and implemented.Thepair-cluster merging index values generated in the ACTD clustering procedure areutilized to find the candidate optimal cluster numbers.The experimental resultsshow that the new method often can identify the clustering structure of transac-tional data effectively.(3) The stability problem of transactional clustering results is studied.Theclustering procedures of traditional partition-based clustering algorithms are oftentrapped into local optimal results,while the SOM neural network has stable cluster-ing results but only for numerical data.So a GHSOM-based clustering method fortransactional data is proposed,i.e.GHSOM-CD method,which introduces the con-cept of Coverage Density into the GHSOM learning algorithm.The new methodimproves the neuron weight values updating way and the network training stopcriterion.The experimental results show that the GHSOM-CD method producescorrect and meaningful transactional clustering results.(4) The frequent itemsets compression problem is studied.To solve the problemthat frequent itemsets mining often generates a large collection of frequent itemsets,a dynamic clustering method,i.e.EESC method is proposed to compress frequentitemsets approximately.The EESC method is based on two frequent itemsets intra-cluster similarity measures:the expression similarity and the support similarity.The experimental results show that the approximate frequent itemsets method isfeasible and the compression quality is good.
Keywords/Search Tags:Transactional data, Coverage Density, SCALE Framework, Clustering Structure, Frequent Itemsets
PDF Full Text Request
Related items