Font Size: a A A

Research On Clustering And Classification Based On Support Vector Technique

Posted on:2011-10-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:P LingFull Text:PDF
GTID:1118360305453677Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Learning is indispensable to people's life and improvement. People gain individual experience through learning to adapt to the fast change of the environment. During the development process of mankind culture, there is a big increase in data size and data information. That leads to the fact that people's learning speed and length cannot meet society's requirement. Therefore learning turns to computer's program from people's thinking activity. Machine learning is defined as the process to simulate people's artificial learning with computer. In spite of the amazing development since the computer comes out, it goes ahead in simulating intelligent activities of human brain very slowly. People's learning is based on statistics, but the machine learning methods based on this theory come across small-sample, nonlinear, over-learning, curse of dimensionality and local minimum, etc. Thus, people enriched and improved this theory, and created the statistics learning theory. Based on this new theory, multi machine learning methods are reached and designed. Support Vector Machine (SVM) is the outstanding method with high generalization based on this theory. With the solid theoretical base and excellent empirical behaviors, SVM becomes the hot topic of research, acts as the strong tool within machines learning field, and plays the vital role in applications.Starting from the theoretical research and algorithm development of support vector technique, this thesis did systemic research in unsupervised learning, supervised learning, geometric properties of support vectors in feature space, and multi-relational data mining etc. Then, some algorithms were proposed: reduced support vector clustering, agglomerative clustering based on support vector technique, twin-classification frame, and novel multi-classification schema based on three-division classifier, etc. Details of the thesis are following.1) Research on reduced support vector clustering.Focused on the high cost and poor performance of traditional Support Vector Clustering (SVC), the thesis presented the Reduced Support Vector Clustering (RSVC) to overcome two bottlenecks simultaneously. RSVC consists of two crucial components, reduction strategy and the new labeling approach. Reduction strategy is based on Schr?dinger Equation, to select those data that do great contribution on model formulation and form the reduced subset. The objective function is refined, so as to results in fine cooperation with reduced strategy. The new labeling approach is based on the geometric properties of support vectors in feature space, and fulfills cluster detection in a simple and effective way. The geometric properties are verified theoretically; the quality of the reduced subset is also analyzed theoretically. Experimental results demonstrate that, compared with original optimization process, reduction strategy and the modified objective produce the model without obvious loss of quality. Experiments also show the new labeling approach has higher performance and efficiency. RSVC algorithm exhibits better clustering ability and higher efficiency than its peers. The research on geometric of support vectors in feature space in topic broadens the support vector technique research.2) Research on agglomerative clustering based on support vector technique.This topic combines the support vector technique with the agglomerative clustering process, and designs the novel clustering algorithm. The proposed algorithm is equipped with some novels. Firstly, the initial spherical grids to be amalgamated are generated by support vector technique. That shortens the amalgamating process and increases clustering efficiency. Secondly, a clustering result selecting method is given based on support vector technique. That brings the controlling mechanism for clustering operations, and makes sense in concrete applications. Thirdly, grid distance is used in amalgamating operations. This distance definition is given based on random projection and removes the influence brought by data distribution. Finally, to address big-sized dataset, low-costly rectangular grid and corresponding clustering result selecting method are designed. That brings more adaptation for the algorithm. Experiments test all algorithm components; and find the spherical grid has good quality; amalgamating clustering process based on this spherical initial grid does a good job in performance and efficiency. Experiments also find the rectangular grid has high efficiency and moderate quality; amalgamating clustering process based on this rectangular initial grid does a moderate job.Support vector methods belong to model-based technique, with abundant theoretical base. Agglomerative clustering methods depend on amalgamating clusters. The integration of two clustering ideas brings generalization and controlling ability for clustering process. This forms a meaningful research direction.3) The research on SVM interface and twin-classification frame.The underlying meaning of SVM interface is systemically discussed from interface model, mathematical meaning, geometric meaning, etc. In details, following jobs are done. Firstly, based on the information of interface function, the confidence function is defined to score the quality of SVM decision. Secondly based on the derivative of interface function, a new metric is defined and this metric is friendly to classification task. Then thirdly, the twin-classification idea and concrete algorithm are presented. This idea consists of the offline global classifier and the online local classifier. The confidence value of global classifier decides the construction of local classifier. The aim of this idea is to gain the increase of classification accuracy with a little increase of cost.Two algorithms of twin-classification take SVM as global classifier, and employ different local classifiers. The first algorithm uses the nearest neighbor classifier as local classifier, which works in the space spanned by the new metric. The second algorithm uses fuzzy classifier as local classifier, which works in the space spanned by a shortest-path-based metric. This metric adopts much information of data distribution to address some datasets with special distribution shape. Experiments show, the classification-friendly metric achieved the expected goal, and the shortest-path-based metric avoided the influence brought by data distribution shape, and both of them help to improve the classification accuracy. Experiments also suggest that twin-classification frame does better than usual single classifier, and its cost increase caused by the local classifier is not so high. Even the increase of cost can be controlled through specifying specific condition of local classifier. Thus twin-classification frame can be applied in broad areas.Finally the twin-classification idea is extended to a more general concept, twin-learning idea. This idea asks the global learning machine and the local learning machine to cooperate with each other.4) The research on support vector technique and the multi-classification algorithm based on three-division classifier.Theoretically, the coincidence on basic model of SVM, SVC and SVR is interpreted. This interpretation is given from model function meaning, geometric meaning and model derivation. And find they share the same spirit to find the important data as support vectors and construct model using support vectors. In algorithm design, three support vector methods are combined to form a new multi-classification algorithm. This new algorithm employs the basic classifier that is generated from SVM and SVR to divide data into three sides. Basic classifiers are trained on qualified data representatives, which are found by a modified SVC process. Basic classifiers are organized in a tree-shaped structure, where the radius rule, method in rejection case, and method to sort classes are included. The test process corresponds to a path from root to leaf. Experiments show components of this algorithm works well, and the behaviors of this algorithm are superior to counterparts.5) The reach on multi-relational data mining based on support vector technique.In this section, structure characters of multi-relational data are analyzed firstly, and then the kernel function definition idea is proposed. That is, to decompose the global affinity into some local affinities. These local affinities come from concerned extending table. Based on this idea, a concrete kernel function and a kernel function frame are defined. To the kernel function frame, the parameterization approaches are given to address semi-supervised scenario and supervised scenario. The semi-supervised scenario has affinity information; supervised scenario has label information. And the parameterization approach for supervised scenario employs the risk score mechanism of classifier. Here, SVM is used as the classifier, and a compromise risk score function is proposed specially. Based on these, an iterative algorithm is given to fulfill the parameterization and model development simultaneously. Finally the compromise risk score mechanism is extended to general learners.Experiments on real datasets demonstrate that the proposed kernel function posses high ability to measure affinity, the parameterization approaches do good jobs and help to produce good learning results. In this topic, compromise risk score mechanism brings more insights: combining the theoretical risk score mechanism and the empirical risk score mechanism to from a more instructive and more rewarding risk score mechanism. The compromise score mechanism is expected to do contribution in model development and parameterizations. This direction deserves more efforts.Machine learning methods based on support vector technique are strong tools. This thesis devoted many efforts and much work in theoretical research and algorithm design of this aspect. Although the work could be a drop of water of the broad sea, the promising ability of support vector technique can be revealed greatly. Therefore, it is well expected that support vector technique will play a more significant role in the ever-developing era.
Keywords/Search Tags:Support vectors, Reduction algorithm, Twin-classification idea, Classification-friendly metric, Compromise risk score
PDF Full Text Request
Related items