Font Size: a A A

The Methods Of Multi-database Clustering

Posted on:2015-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y W CuanFull Text:PDF
GTID:2268330431458479Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of network and database technology, large institutions especially transnational corporations have accumulated large transaction databases, which means multi-database. How to acquire useful knowledge from multi-databases efficiently is a new challenge in data mining. Therefore, multi-database mining becomes an important new research subject. To deal with the large data in multi-databases, the effective method for multi-database mining is to classify them firstly, and then mine. Similar to conventional classification techniques, the method for classifying multi-database include classification and clustering. In this paper, we researched on the technology of multi-database clustering.Clustering is an important technology in machine learning, which has been successfully applied in many applications such as text and webpage classifications. This technology can automatically classify the data without priori knowledge, and is convenient for specific data mining. Because of the structure of transaction data is obviously different from text and webpage, the traditional clustering technology can not be directly transferred to the multi-database classifications. Therefore, how to clustering the multi-database attracts the interest of scholars, and many studies had been conducted. However, with the demands for application increase, limitations of existing multi-database clustering techniques become more apparent.In multi-database clustering, the keys are similarity measurement and clustering algorithm. The way to measure databases similarity directly impact the clustering results, and clustering algorithm is related to the performance of clustering methods. Currently, in the multi-database clustering, common similarity measurements are based on the similarity coefficient of frequent itemsets in transactions, and the clustering methods mainly includes partition and hierarchy algorithm. The partition clustering methods can obtain more ideal clustering results, but the time complexity is higher. The hierarchy approach can obtain clustering results in short time, but it may loss better classifications.In this paper, based on the theory of clustering analysis, we deeply studied on the limitations of multi-database clustering, discussed the similarity measurement metric and clustering algorithms, and implemented several experiments to prove the effectiveness of the proposed methods. The main works are as follows:(1) Deeply discussed the metric of database similarity measurement and the evaluation criteria for clustering results.At present, the similarity measurements for multi-database are mainly based on the similarity coefficient of frequent itemsets in transactions, which mainly considered the similarity between the local databases. We designed a new similarity measurement which put emphasis on the difference between the databases. The evaluation for clustering result decides the practical value of clustering methods, and based on the existing research, we proposed a novel standard for evaluating clustering results by synthetically considering the inner distance, outer distance and amount of classes.(2) Proposed the hierarchical clustering method for multi-database based on links.The relationship between data objects includes adjacent, link and independent. The traditional clustering methods assign classes by comparing similarity between objects, but may be disturbed by abnormal data. In order to avoid the interference, ROCK algorithm utilized the link of objects to complete classification. Based on the idea, we redefined the links between local databases, and proposed a hierarchical clustering approach based on links. Our methods can effectively eliminate the influence of abnormal data, and obtain satisfying clustering result in short time.(3) Designed the mean clustering method for multi-database.Based on the K-Means and FCM algorithms, we designed K-Mean and Probability-Mean clustering methods for multi-database respectively. The K-Means multi-database clustering method computed the mean distance between database and class, and obtained the final clustering result through allocating databases iteratively. Probability-Mean clustering method optimized the classes by adjusting the value of subordinate matrix, and then got the final clustering result. Experiments proved that our methods are effective.Multi-database clustering is an important technology of multi-database mining, which can classify large transaction databases effectively, and be convenient for deeply data mining. In this paper, we discussed the problem of multi-database clustering and proposed effective clustering methods. The theoretical analysis and experiment results showed that our research is valuable.
Keywords/Search Tags:Multi-database Classification, Cluster Analysis, Mean Clustering, Probability MeanClustering
PDF Full Text Request
Related items