| Recently,big data analysis technology has been applied widely.Clustering method divides samples into clusters by calculating the similarity.It can help us find hidden relationships between samples.In the medical field,clustering can mine potential information and provide decision support for medical researchers.In this thesis,the extraction algorithm of disease risk factors based on clustering technology is studied.The specific work is as follows:The K-means clustering method based on improved Canopy to extract risk factors is constructed in this thesis.First of all,feature selection is used to filter features,and the improved Canopy algorithm is to obtain the cluster number and the initial center points.Then the internal relations of the feature variables are mined by K-means,the risk factors are extracted by calculating correlation index.The algorithm has achieved good performance in predicting the number of clusters and other evaluation indicators of clustering performance.The clustering method for fixed weights can not describe the geometry of data.In this thesis,the idea of dynamically adjusting weights is applied to K-means algorithm.This algorithm constructs initial weights by SVM-RFE,the initial center points are more proximate to the distribution of data.And clustering can converge more quickly and reduce the number of iterations.The algorithm selects the risk factors according to the weights of variables.The experiments show that the algorithm achieves better performance on clustering time efficiency,and verifies the effectiveness of selecting key features based on feature weights.The medical data is intricate and hard to analyze.This thesis combines Gaussian mixture clustering and hard clustering method which integrated the strenghs of them to apply to medical data.And also improved the initial parameters of EM algorithm.We proposed“weighted hierarchical coefficient” to calculate the importance of each feature node by decision trees.We studied the internal tendency of the subset of dataset by boosting. |