Font Size: a A A

Study On Some Issues Of Data Clustering In Data Mining

Posted on:2006-06-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhaoFull Text:PDF
GTID:1118360182460110Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Data mining, which analyzes and processes volumes of data, and helps peopleeffectivelyobtain the useful and conclusive information or knowledge, is becomingoneof the most advanced and active research topics in the field of informationdecision-making. Database, machine learning and statistics are three supports of thedevelopment of data mining technology. Derived from statistics, clustering analysis isone of the main tools of data mining. Data clustering has been studied extensively inpast decades, and a mass of theories and methods have been achieved. As yet, there aremany problems in clustering, and especially with data mining technology used invarious industries and the data becoming more and more complicated, a lot of newchallenges lies in the research on data clustering. It is necessary that the presenttechnologies of data clustering are improved and that novel theories and methods areputforwardfornewapplications.This thesis mainly focuses on issues of the clustering validity, the initialization ofclustering algorithms, categorical data clustering, and the clustering algorithms for datawithhigh-dimension. Themaincontentsofthedissertationareoutlinedasfollowing:In chapter 1, a brief introduction of data mining technology and the characteristicsof clustering used in data mining are concerned. The status quo of the research onclustering validity, the initialization of the iterative refinement clustering algorithms,categoricaldataclustering,andtheclusteringalgorithmsfordatawithhigh-dimensionisreviewed.Finally,themainachievementsandarrangementsofthethesisarepresented.In chapter 2, clustering analysis in data mining is introduced, involving thestructures and types of data, the clusteringcriterions, and the classification of clusteringalgorithms. Themain clusteringalgorithms usedindataminingareintroducedindetail.Finally,themethodsforevaluatingtheclusteringresultsareexplained.In chapter 3, the clustering validity functions are mainly studied. The partitioncoefficient and the partition entropy of fuzzy clustering, and the clustering validityfunctions base on geometrical structure of data set are surveyed. Two novel clusteringvalidity functions are proposed, one of them is from the point view of the compactnessand the separation of the fuzzy c-partition and another is based on the combination ofHubert Гstatistic with the separation. Meanwhile, the defects of the presentevaluation methods of clustering results are indicated, and the opinion is presented thatthe clustering accuracy reflects the efficiency of the clustering algorithm. Finally,severalpartitionsimilaritymeasuresareusedastheclusteringaccuracy.In chapter 4, several initialization methods are studied, including the randomsampling, the distance optimization, and the density estimation. A novel initializationmethod based on the distance optimization is proposed. Compared with the presentinitializations, the new method needs not a threshold, is immune from the order of dataset, and is insensitive to outliers or noise. In additions, the k-harmonic means (KHM)algorithm is investigated, and the fuzzyk-harmonic means (FKHM) algorithm is given.Experiment indicates that not only the FKHM algorithm inherit KHM's characteristicinsensitive to initialization, but also the quality of clustering results can be improved.Finally, a unified expression for the iterative of centers is described, and the conditionalprobabilityexpressionsofthecentersanddataweightfunctionsforFKHMarededuced.In chapter 5, k-modes algorithm, k-prototypes algorithm and fuzzy k-modesalgorithmareinvestigatedwithemphasisontheperformanceofk-prototypes.Thenovelcategorical data clustering algorithm proposed in this paper considers the differentcontribution of each attribute of data to the clustering, and weights each attributes ofdatainthe clusteringprocedure. Witha new fitness defined,the evolutionarystrategyisused to optimize the weighting matrix. The final weights reflect how important eachattribute of data is for the clustering. The algorithm is effective in discoveringinfluences of each attributes of data on the clustering and leads an improved clusteringresult hereby. Moreover, the weights optimized can be used to extract attributes of dataor reduce the dimensions of data. Based on the approximate k-median algorithm, anapproximate k-median clustering algorithm for categorical data is developed. Thealgorithm replaces the modes in k-modes algorithm with the approximate medians ofdata set, and optimizes the center of cluster with the approximate k-median algorithm.Thecenterofclusterisanactualsampleofdataset,whichpreventstheemptycluster.In chapter 6, Hsim(), a similarity measure function for high dimensional data issurveyed. The function can not only avoid the problems that L k?norm leads to thenon-contrasting behavior of distance in high dimensional space, but also adapt to bothbinary and numerical data. A fuzzy k-median clustering algorithm based on Hsim() isproposed. The algorithm uses Hsim() as the similarity measure of data, and uses thefuzzy k-median algorithm optimize the center of cluster. The experiments indicate thealgorithmiseffective.InChapter7,thesummarizationtothethesisisgiven.
Keywords/Search Tags:DataMining, FuzzyClustering, ClusteringValidity, ClusteringInitialization, CategoricalAttribute, HighDimensionalData
PDF Full Text Request
Related items