| In real world applications,it usually contains a large number of complex types of data,in which symbolic data is a typical data type.In the past 30 years,the research of data mining and machine learning has made great progress in many fields,but most of them focus on the analysis of numerical data.How to effectively analyze symbolic data and serve practical application is one of the most important issues in the field of data mining and machine learning.In the field of machine learning and data mining,the main analysis method for symbolic data is to improve the existing algorithms that have been successfully applied to numerical data.Since the values of the symbolic data usually do not have a numerical or ordinal meaning,such that the measurement tools,which are generally applicable to numerical data,such as distance,inner product,mean,and centriod,cannot be directly apply to symbolic data ananysis.Therefore,machine learning and data mining algorithms widely used in numerical data,such as K-Means,DBSCAN,KNN,and SVM,cannot be directly applied to symbol data analysis.This makes the analysis and mining of symbolic data more difficult and challenging than numerical data.This paper applies Kernel Smoothing method and Mercer Kernel method to symbolic data analysis,establishes a kernel estimation model for symbolic data analysis,proposes a space transformation model for self-expression of kernel data,solves some basic problems such as similarity/dissimilarity measurement and inner product calculation of symbolic data,and gives new solutions to clustering analysis,classification analysis and rare class mining of symbolic data.The main research work of this dissertation includes:Firstly,for the problem of probability distribution estimation of symbol data,it establishes a kernel estimation model of symbol data based on kernel smoothing method,proves the condition of consistent estimation of kernel probability,proposes the optimal estimation method of kernel bandwidth,and provides a model basis for subsequent research.Secondly,aiming at the similarity measure and inner product calculation of symbol data,a kernel data self-expression space transformation model KDTM for symbolic data is proposed.It designs a general method for calculating the similarity,distance and inner product of symbolic data,and makes an in-depth analysis of its theoretical properties.These methods are of universal significance and effectively solve the basic problem of the similarity/dissimilarity measurement of symbol data analysis.Thirdly,we study the non-linear classification method for symbolic data,give the solution to the problem of Mercer kernel computation for symbolic data using the new inner product and distance calculation method,and propose the SVM-S algorithm for non-linear classification of symbolic data.Tests on multiple data sets show that the SVM-S algorithm has a good classification effect.Fourthly,the problem of Clustering Oriented to symbolic data is studied.By using the kernel learning model of symbolic data mining,a central representation of symbolic data cluster represented by Bayesian probability is defined to solve the problem that the center of the symbol data cluster cannot be represented by mean.A soft subspace clustering algorithm KCC for symbolic data is designed and a new cluster validity index is proposed to evaluate the clustering quality of the clustering algorithm and determine the number of clusters in the data set.Sufficient tests show that the KCC algorithm has good clustering performance and time performance.Fifthly,we study the rare category minning problem of symbolic data.The kernel learning method is applied to the problem of rare category mining of symbolic data,and the symbol frequency differencedistance-based distance metric method(FDDM)for symbolic data is proposed.A recognition algorithm RCDCS for rare category of symbolic data based on data density and difference criterion of data distribution between clusters is proposed.Tests on various data sets show that the RCDCS algorithm has good performance.At the end of this dissertation,a summarization of the research work is presented and future research prospects is discussed. |