| With the rapid development of information technology, clustering is a very active research direction in the field of data mining, and it has been widely applied in image processing, information retrieval, meteorology, financial and other fields. But the boundary points of clusters are located at the edge of the clusters, and the right ownerships of boundary points directly affect the precision of clustering. At the same time, boundary points also have the characteristics of multiple clusters. In recent years, cluster boundary detection also has become an active research direction in clustering. In reality, compared with the numerical attribute data and categorical attribute data, the mixed attribute data has more extensive sources, but the cluster boundary detection on mixed attribute data is still a blank. Therefore, in order to meet the need of extracting cluster boundary on mixed attribute data, the related research and application have been worded in this thesis.Firstly, in order to solve the problem of the cluster boundary detection on mixed attribute data, a cluster boundary detection algorithm for mixed attribute data, named BERGE(Cluster boundary detection technology for mixed attribute data set), is proposed in this thesis. The algorithm is based on a kind of effective measurement method to deal with the mixed attribute data. Firstly, the distances and memberships from data objects to the clusters centroid are calculated on mixed attribute data. And then, according to the distances and memberships, the boundary factor is defined to obtain the candidate boundary set of data set. Finally, based on the idea of evidence accumulation, the cluster boundary points are extracted from the candidate boundary set. The experimental results on UCI data sets and real data sets show that the BERGE algorithm can effectively obtain the cluster boundary of the mixed attribute data, numerical attribute data and categorical attribute data. The algorithm has high detection precision, and has a certain inhibitory effect on the noise, etc.Secondly, aiming at solving the problem of how to extract the boundary of a specified cluster or several specified clusters on mixed attribute data, a cluster boundary detection algorithm on mixed attribute data based on shadowed set, named CHASM(A cluster boundary detection algorithm base on shadowed set), is proposed in this thesis. The algorithm uses the shadowed set to measure the fuzziness. According to the structure of cluster, a new optimization objective function is defined to divide the mixed attribute data into core, exclusion and shadow three sets in any cluster. Then, according to the variance of contribution degree from the three sets to the clusters centroid, the distances and memberships from data objects to the clusters centroid are calculated to update the centroid information of clusters. When the execution of the algorithm is converged, the algorithm extracts the shadow set of each cluster as the boundary set of the whole data set. The algorithm can effectively extract the cluster boundary set of mixed attribute data, numerical attribute data and categorical attribute data, and also can obtain the boundary set of specified cluster of the data set.Finally, based on the need of extracting the cluster boundary on medical mixed attribute data, a medical data clustering analysis platform software, named MDAP(Medical data analysis platform), is proposed in this thesis. The software adopts the design thought of object-oriented, and it is mainly divided into 9 modules(central control module, data type conversion module, data format conversion module, data input and output module, data display module, data preprocessing module, clustering analysis module, cluster boundary detection module, parameter setting module). Among them, the software mainly implements the 5 kinds of classical clustering methods and the 11 kinds of cluster boundary detection algorithms, and mainly provides the functions of data preprocessing, clustering analysis and cluster boundary detection for mixed attribute data, numerical attribute data and categorical attribute data. The software adopts the incremental development model and the design of the factory pattern. These greatly improve the flexibility and extensibility of the software and conveniently add algorithms or modules in the future. |