Font Size: a A A

An Improved Classification Algorithm Of SVM For Learning Unbalanced Datasets

Posted on:2011-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:B YaoFull Text:PDF
GTID:2178360305454941Subject:Network and information security
Abstract/Summary:PDF Full Text Request
Classification is an important mission of data mining in databases. Large amounts of classification algorithms and newly reported classification algorithms have been developed and successfully applied to many fields. However, the algorithms always failed to deal with the imbalanced data sets in real problems where one class might be concentrated in a large of samples and the other classes own very few. If traditional classification algorithms are used to classify the imbalanced data sets, the accuracy of the smaller class is lower than that of the imbalanced data sets, so traditional classification algorithms are finite to them.There have been attempts to deal with imbalanced datasets in domains such as fraudulent telephone calls, medical treatment, text classification and detection of oil spills in satellite images. Recently more and more domestic and foreign experts have paid more attention to the imbalanced data sets classification.As a new machine learning method, support vector machine has simple structure and good generalization ability. The algorithm is based on statistical learning theory and structural risk minimization theory which can effectively overcome the local optimal solution problem and convert it to convex optimization problems. Support Vector Machines (SVM) has been extensively studied and shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets. When finding an optimal solution, SVM has to calculate an n-dimension kernel matrix with time complexity of O(n2). When the sample scale is large, the training speed will slow down which makes the algorithm inconformity for training the large scale samples. We made a detailed analysis on why the traditional algorithm has poor effect when dealing with imbalanced datasets. On the basis of Hull theory and SMOIS over sampling theory and from the angel of increasing training speed, this paper proposed a new algorithm, SHS. SHS is able to scale down the training scale by solving hull vectors of the original training datasets. The basic process is described as follows:(1)We construct classifier and derive classification model by training the new reconstructed data sets. In the process of optimization, we have to compute the kernel function matrix, which involves inner product between the newly synthesized samples and original training samples, the inner product between the newly synthesized samples, and the inner product between the test samples and the newly synthesized samples. In this paper, the calculation of the kernel matrix and the corresponding kernel function formula is given in chapter IV. (2)The minority class is mapped to the high-dimension feature space rather than data space. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. Repeat these operations for N times, the synthetic samples and the original sample together are as the reconstruction of data sets. According to the theory of convex, we compute the negative vector sets and get the shell samples as new negative class data sets. After the above two steps, we complete the reconstruction for the original data sets.We select Gmeans value, ROC, algorithm execution time as the standard evaluation. Experimental results show that classification accuracy and classification speed of SHS is improved compared with SMOIS algorithm. Over-sampling techniques in high dimension feature space reduces the degree of imbalance in positive and negative categories and due to the shell vector only take a small part of the whole training samples, the algorithm greatly narrows the size of the training set and reduces the time and space complexity as a result.The major innovation and contribution of this thesis are listed as follows:(1)We proposed a novel over sample method based on imbalanced datasets. Aiming the problem that occurs when generating synthetic examples using SMOTE method, we generate synthetic examples in the feature space rather than the data space. In order to reduce the training data set scale and speed up the training time, we calculate the shell vector for the negative class. Based on two sampling techniques and combined with support vector machines, we proposed a new classification algorithm for imbalanced data sets SHS.(2)Apply and analyze our classification algorithm of SHS in Text classification. Due to the characteristics such as large scale, high dimensions and imbalanced class distribution of the text data, we employed the feature selection to preprocess the data firstly and then utilized the effective classification methods towards text classification. Such an analytical scheme would be very powerful for the automatic and intelligent analysis of text data.It is still in the initial stage for the imbalanced data sets research. At present, there is also a lack of guidelines for the choice of kernel function and parameter configuration, most of time the parameter configuration and the type of kernel function are chose by experience. We still need further research for the classification of the extremely imbalanced and large scale datasets.
Keywords/Search Tags:SMOIS algorithms, Support vector machines, Hull vector, Imbalanced data sets, Text classification
PDF Full Text Request
Related items