Font Size: a A A

Research On Improved Support Vector Machine Based On Category Imbalanced Dataset

Posted on:2018-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:B B ZhangFull Text:PDF
GTID:2348330518453951Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid advances of computer technology have led cumulative data to be explosive growth.In order to make full use of these data to conduct the current work and scientific research,the technology based on data mining emerged and developed rapidly.In many practical studies,the dataset is category imbalanced.Namely,the number of samples belonging to one category is greatly different from the number of samples belonging to the other category.However,the small sample size class is always more valuable,so the problem of category imbalanced dataset classification is a hot research topic in the field of data mining.The traditional machine learning algorithms often decrease the recognition rate of small sample size class,which makes the classification performance of classifiers greatly reduced.SVM is a method that is in view of statistical emulating exoterica and has substantial theoretical evidence.SVM has higher assortment property than other classification algorithms in category balanced datasets,but is slightly deficient in classifying the imbalanced data.In this paper,a classification method based on Support Vector Machine(SVM)-Demarcation Threshold Support Vector Machine(DP-SVM)is proposed for the difficult classified problem of the category imbalanced dataset and improves it from the below two aspects.1.Processing of the aliasing data at the classifying boundary.In this paper,we mainly deal with the classifying boundary of two kinds of samples,but the data at the classifying boundary is more important for the construction of Support Vector Machine.Most previous research work directly deleted the data at the classifying boundary or simply added them to the small sample size class,which looks down upon the impact of the aliasing data on the classification accuracy of the small sample size class.On account of this,the aliasing data at the classifying boundary are partitioned and processed meticulously in this paper.2.Pruning of two kinds of support vectors.This paper takes some corresponding processing strategies according to the relationship between the number of samples in the small sample size class and the number of support vectors in the large sample size class.When the number of samples in the small sample size class is balanced with the number of support vectors in the large sample size class,we can introduce the soft margin to obtain the optimal hyperplane.Otherwise,we provide two strategies that are the SMOTE algorithm and the referenced principal component analysis method,and choose the optimal one from them if necessary.When the small sample size class is relatively rare and the number of support vectors in the small sample size class is far less than that in the large sample size class,we choose the samples whose weights are comparatively greater from the small sample size class to make the sum of the chosen samples and the number of support vectors in the small sample size class is balanced with the number of support vectors in the large sample size class,and then a classifier is constructed.
Keywords/Search Tags:support vector machine, aliasing data, soft margin, category imbalance, support vectors
PDF Full Text Request
Related items