Font Size: a A A

Support Vector Clustering For Multi-Relational Data: Design And Implementation

Posted on:2007-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:P LingFull Text:PDF
GTID:2178360182495999Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
This thesis focuses on the clustering problem of multi-relational data.Based on the study of the development of Multi-Relational Data Mining(MRDM) and its algorithm architecture, a novel Two-Phase Support VectorClustering (TPSVC) is proposed in this thesis with support vector technique.A Kernel function tailored to the multi-relational data environment isadopted into TPSVC, with intention to grasp the affinity among structureddata.The ever-going development of the society gives rise to the higher de-mand on data expressions from a number of practical problems. It promotesdata expression from onefold form to multifold forms, that is, the descrip-tion of data is conveyed by many connected tables, or by a relational data-base. This extends traditional data mining to further multi-relational datamining, with requirement to do analysis by integrating multi informationsources. A straightforward solution to MRDM is to aggregate all tables intoa comprehensive one, and extracts knowledge from this single table withhelp of traditional mining algorithms. This, however, would cause the infea-sibility of the storage and computation. What's more important, the originalstructured information and relational information are lost in the processing.Therefore, it is natural for MRDM to ask to learn from multi tables directly.MRDM is the combination of machine learning realm and relationaldatabase technique. In details, in order to mine from tables simultaneously,we must consider to combine the concrete mining methods encoded withmulti-relational environment and the qualified models designed formulti-relational data. It demands that the designed expression language ofdata model should carry inner features of data, and reveal well the relationalinformation among data structures.Although the name of MRDM was proposed on KDD conference, 2002,it has been in growth for more than ten years. At the beginning, it achievedimprovements as the byproduct of ILP (Inductive Logic Programming,ILP).In the following a few years, MRDM moved on with the development ofILP technique, and the ILP algorithms in the main research content ofMRDM. During this period, the guidance idea of MRDM is to upgrade theapproaches of single-table version to multi-table version. Some effectivealgorithms were put forward. For example, WARMR algorithm is the newversion of Apriori algorithm;SCART is the upgrading version of CART al-gorithm. In the next period, Relational Learning Approaches make somedomination in the MRDM algorithm architecture. This approach is based onrelational learning methods of graph. In the latest years, more advancedtechniques are introduced in MRDM. Some representatives are Kerneltrick,support vector technique, spectrum technique, and the data modelsthat integrate logical mechanism and probability mechanism, like BLP andPRM.This thesis is motivated by Kernel trick and support vector technique toaddress clustering issue. Support vectors are employed as the description ofcluster contours. The key idea to formulate contours is to find a minimumhyper sphere that includes all data in the feature space. Based on the con-tours obtained, a support vector bi-classification procedure is iterativelyperformed between the set of inner-cluster points and the set of points thatdescribe contours. Actually, the second set consists of non-bounded supportvectors (nbSVs). The new separating planes are expected to get closer tocluster centers. When separating planes reach some depth, stop iterationsand integrate the nbSVs that are generated by the last bi-classification.These nbSVs are viewed as carriers of the center information. The resultedgroups of nbSVs can be regarded as the cluster centers. The final cluster as-signment is done by computing the affinity between points and cluster cen-ters. This assignment approach is with ease to implement, without sufferingthe pricy cost spent in traditional support vector clustering.The Kernel function designed for multi-relational data is adopted inTPSVC to help the algorithm work in MR environment. Based on the asso-ciation frame, the designed Kernel draws local affinities from the involvedexpanding tables, that is, those tables that are connected with the main tabledirectly or indirectly. Then these local affinities are fused into one globalaffinity according to the convolution idea and summing idea. In the formu-lation of local affinity, different association cases generated by the foreignkey and referenced key are taken into consideration, so as to guarantee thatthe effective feature extraction is conducted well in each information source.The designed Kernel asks to retrieve information from involved tables re-spectively and simultaneously, which meets the demand of MRDM.When it comes to the setting of scale parameter of elementary GaussianKernel function, this thesis brings a self tuning strategy. The tuning strategyaims to learn distribution information from data neighborhood, and uses thiscontext information as the reference of the affinity scale. The distributioninformation of data neighborhood is revealed by distance information. Thesize of neighborhood is described by the number of points that areclose-related with the given point. For each point, we find the individualdensity factor from its neighborhood. When measuring the affinity betweentwo objects, their respective density factors will be combined and act as thereference. The old scale parameter is with confused geometrical meaning,with much difficulty of controlling it suitably. While the new neighborhoodsize parameter is of clear and simple geometrical meaning, with more easeto set manually. And a heuristic is also presented following the tuning ap-proach. This approach equips TPSVC with the ability to adapt to datasets ofdiverse densities, which promotes the generalization of TPSVC. Further-more, this approach eliminates the pricy cost devoted to searching parame-ter space that is necessary for the traditional optimization algorithm.To access the performance of the designed Kernel, we introduce it andthe existing Kernel functions that appeared in literatures into the classicalSVM procedure, with aim to classify the real relational datasets. The resultsshow, the designed Kernel produces higher accuracy than the existing ones.It justifies that the designed Kernel can take an all-around view to treat dataobjects by integrating their features from diverse respects. It is because thatthe designed Kernel formulates the affinity between objects by drawingtheir distance in feature descriptions. Then, implement TPSVC in the sin-gle-table data environment to cluster single-table data, and it is found thatthe performance of TPSVC reaches or gets close to the state-of-the-art re-sults. This also demonstrates the fine clustering property of TPSVC algo-rithm. Finally, encoded with the designed Kernel, TPSVC is conducted onthe real relational datasets. The resulting conclusions obtained by TPSVCprovide a comprehensive analysis to the relational objects. After fusing theinformation of multi descriptions, the algorithm has the ability to present apractical interpretation or comments on relational objects.
Keywords/Search Tags:Multi-Relational
PDF Full Text Request
Related items