Font Size: a A A

Research On Clustering And Text Categorization Based On Support Vector Machine

Posted on:2013-03-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y PingFull Text:PDF
GTID:1228330374499631Subject:Information security
Abstract/Summary:PDF Full Text Request
With the advent of big-data age, besides the regular content resources with usability and validity generated by diverse applications, a plenty of malicious messages and behaviors are simultaneously distributed on the Internet that might interfere regular service, violate privacy, misguide people and even harm the social stability. Furthermore, all of these data are flowing and expanding dra-matically. From the perspective of data management, it is essential to organize, analyze, retrieve and protect the useful data or sensitive information in a fast and efficient way for customers from different industries and fields; whereas the sensitive information and malicious messages or behaviors, from the perspec-tive of content security, are respectively expected to be found out for protection, and to be classified, filtered and analyzed for tracing the attackers, protecting victims as well as invoking the intelligent defense systems to process data, learn knowledge and update model. Among the machine learning methods, since cluster analysis (unsupervised learning) and classification (supervised learning) are able to be employed for detecting, tracing, organizing and analyzing either available information or behavior patterns, they are suggested to be the effec-tive ways and crucial techniques for maximizing the efficiency of information security and protection.As an important machine learning method based on statistical learning the-ory, not only did the support vector machine (SVM) is able to learning from small-scale samples effectively, but also resolve such practical problems as non-linearity, high dimensionality and local minima, etc. Eventually, the mathemati-cally closed separating hyperplanes can be generated by SVM for clustering data in an unsupervised way, while the unclosed separating hyperplanes are usually constructed for handling the supervised issues of data classification, especial-ly for text data with high dimensional and sparse vectors and great correlated features. Therefore, SVM has the excellent qualities of efficient data analysis centered on data management and information security. However, since the data set to be clustered has the characteristics of large-scale, high-dimension, amoun-t of categories, irregular distribution and noise interference, such drawbacks as lower speed in training, parametric sensitivity, and without suitable cluster pro-totypes to improve both efficiency and accuracy are still found in the traditional clustering method based on the SVM (i.e., support vector clustering). Unfortu-nately, as a major existence form of information on the Internet, text usually can be found with those characteristics, according to lower its separability, which will affect the performance of text categorization system, for instance, slowing down both the speed of training and classification, reducing the accuracy and inductive value of the collected support vectors, etc.To overcome the aforementioned issues, the main works of the dissertation could be summarized as follows:(1) Consider the characteristics of both boundary-based clustering and clus-tering with prototype finding inherited by the support vector clustering algo-rithm, in this thesis we figure out the crucial factors which would affect the algorithm’s performance and suggest some practicable directions for making improvements after a series of detail analysis with respect to parameter setting, resolving dual problem and strategy of cluster labeling. Afterwards we conclude the relationship between the kernel width q and the decomposition and combi-nation pattern of clusters, a novel method which extracts an appropriate value of q by applying binary search to reach an expected result.(2) As an important boundary-based clustering algorithm, the major advan-tage of support vector clustering is its capability of handling arbitrary cluster shapes effectively. However, this advantage did cause the algorithm to be sen-sitive to cluster profile, especially to those data interfered by noise data points with sparse distribution. Due to unable to distinguish noise data points and outliers, the traditional support vector clustering algorithms suffer from such limitations as lower efficiency of SVM training for estimating support function and ineffectively exploring data distribution while noise data points are allowed in resolving the dual problem. In consideration of this problem, this thesis gives an innovative definition of noise data points in terms of the distributional char-acteristics and subordination relations to their neighboring clusters. Inspired by the definition, we also develop a noise elimination algorithm which can remove meaningless noise data points before resolving the dual problem as well as im-prove its separability without destroying the profile. Actually, with the help of noise elimination algorithm, the efficiency of support vector clustering algorith-m would be improved since a plenty of useless nonlinear mapping operations are avoided and the memory space required by the kernel matrix is significantly reduced.(3) To achieve improvement on efficiency, one of the principal resolutions should be exploring an expected kind of cluster prototypes. The traditional sup-port vector clustering algorithms either choose the division of support vectors as cluster prototype or prefer a framework of one cluster with one prototype. However, the prior can hardly deals with large-scale data set with high dimen-sion for low efficiency whereas the cluster prototypes found by the latter are usually could not reach an expected inductive value while happens to clusters with irregular profiles or imbalance distributional data points. More importantly the latter frequently reaches improvements on efficiency at the cost of accura-cy. To overcome these problems, in this thesis we introduce a double centroids support vector clustering (DBC) algorithm which employs a framework of one cluster with multiple prototypes and each prototype will be represented by a shape centroid and a density centroid. In the view of the backend principle, the DBC algorithm is a compromising way of the aforementioned two types of tra-ditional algorithms, but it adaptively uses a number of cluster prototype pairs distributing in a irregular cluster. Large amount of numerical simulation ex-perimental results show that the DBC algorithm not only inherits the ability of identifying clusters with irregular profiles, but also is able to reflect the degree of imbalanced distribution in a cluster and improve both efficiency and accuracy for cluster labeling. Furthermore, it is suitable for analyzing large-scale data set for the outstanding inductive value carried by the double centroids.(4) Cluster labeling algorithms is closely related to the way of finding or generating cluster prototypes. Literatures have emerged that most of the current support vector clustering algorithms share the same disadvantages of employing a large number of redundant point pairs from cluster prototypes and segmers on line segment connecting any point pair, when applied on finding connect-ed components. However, it frequently fails in improvement on accuracy with reduced efficiency. Thus, we present a convex decomposition based cluster la-beling (CDCL) algorithm. Even though the proposed CDCL algorithm derives from the framework of one cluster with multiple prototypes, it employs a num-ber of convex hulls with different shapes and sizes which are decomposed from cluster adaptively as prototypes instead of a single data point. Meanwhile, the thesis also gives out detailed investigations and analysis on the crucial factor de- fined as quasi-support vector which usually affects the judgement of connection between any two neighboring convex hulls. Based on this finding, an alterna-tive strategy is proposed by the author for finding the connected components, i.e., to sample the line segments crossing the dense region of quasi-support vec-tor. Actually, the proposed sample strategy will significantly avoid a amount of redundant sampled point pairs. In coordination with the convex decomposi-tion model, a newly nonlinear sample sequence is proposed and recommended for avoiding redundant segmers on each line segment, i.e., reducing the actual average sample rate. Compared to the traditional algorithms, numerical exper-imental results confirm that the proposed CDCL algorithm not only improves both the efficiency and cluster quality significantly, but also has advantage of not sensitive to parameters.(5) Generally, in terms of constructing either the minimal enclosing hyper-shpere or support function for support vector clustering, those data points, i.e., points lying in clusters, outliers and noise data points, are actually not required. What make it worse is that such data points would raise the memory cost and reduce efficiency for training. Based on this consideration, in this thesis we pro-pose a fast algorithm of support vector clustering (FAS VC). Firstly, the proposed algorithm selects cluster boundaries in input space to construct hypersphere, ex-tract support vectors and complete the construction of support function. Then, under the instruction of radius R of the constructed hypersphere, a self-adaptive cluster labeling strategy is employed to invoke either the CDCL algorithm or cone cluster labeling algorithm. Since the training data set can be significant-ly reduced before solving the dual problem and the constraint conditions are loosen by the self-adaptive cluster labeling strategy, the proposed FASVC al-gorithm achieves significant improvements on storage utilization and time effi-ciency that make it is suitable to deal with large-scale data set. Furthermore, the FASVC algorithm is independent of the penalty factor C and insensitive to the other parameters. Various large-scale benchmark results, including text clus-tering and P2P traffic classification, are provided to show the effectiveness and efficiency of the proposed algorithm.(6) Support vector machine has been suggested to be one of the best classi-fiers for text categorization. Due to the principle of structural risk minimization, the performance of the support vector machine is directly related to the sepa-rability of a data set, i.e., the margin between different categories. Therefore, an effective way of increasing data set’s separability is expected to achieve im-provements on text categorization. In the view of information theory, the pro-cedure of vectorizing text could be recognized as a work of text compression, thus how to keep information as much as possible is important for improving performance of text categorization. However, too much information has been abandoned or neglected by the traditional text representation methods for such problems as excessive dependence on document frequency, global strategy em-ployed for term (or feature) weighting factors and irrespective of text struc-ture information, which causes the reduction of separability. In consideration of these issues, different improved methods are developed by this thesis with multiple perspectives.1) Firstly, we give a novel definition of category contri-bution degree for a term, and based on which a category contribution enhanced (called, CCE) scheme that gives consideration to the description of both cate-gory contribution and distributional differences among categories for terms are proposed to weight a term and avoid only dependence on document frequency.2) Secondly, a self-adaptive strategy of partitioning text into content blocks is presented. With this strategy, text structure information can be formulated by the importance of distributed content blocks and be earried by different terms.3) Thirdly, the thesis also defines both class tendency and class basis for each term, and proposed an enhanced scheme which integrates multiple class ten-dencies for terms to strengthen their separating capacities. Moreover, a fresh algorithm, i.e., co-contributions of terms on class tendency for vectorizing text (C2TCTVT), is developed with advantages of maintaining the distributed in-formation of class tendencies which is neglected by the traditional algorithm with the global strategy. Moreover, the C2TCTVT algorithm is not. only able to compress the traditional text vector from high-dimension and sparsity into real-ly low-dimension and compactness, but also make the multiple class tendencies carried on the text vector. Experimental results demonstrate that both of the sep-arability of text vector and the inductive value of the collected support vectors are significantly improved. Meanwhile, compared to the traditional ones, the C2TCTVT algorithm did achieved comparable accuracy as expected.4) Final-ly, in order to make improvements on the measurement of local importance for terms, the thesis presents two group of weighted term frequency methods with text structure information embedded which are recommended to be substitutes for term frequency. Especially, a great improvement on performance of text cat- egorization based on support vector machine can be achieved when these two methods are combined with the proposed CCE scheme.
Keywords/Search Tags:content security, support vector machine, cluster analy-sis, support vector clustering, text categorization
PDF Full Text Request
Related items