Font Size: a A A

Support Vector Machine Based On Multiple Datasets

Posted on:2020-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z H GeFull Text:PDF
GTID:2370330572969690Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The 21st century is the era of big data.The rapid development of computer technology has greatly facilitated the acquisition and storage of data,enabling many departments to generate massive amounts of data every day.Big data is usually a combination of data from different sources,subjects or formats.Data sets are different from each other due to different data sources.However,when studying the same problem,there is a certain correlation between variables and labels in different data sets.Given that big data has the characteristics of data source difference,high dimensionality and sparsity,how to mine heterogeneity and homogeneity between data sets as well as reducing dimension and denoising is one of the goals and challenges of big data analysis.At present,multi-source data analysis has been researched and applied in the fields of biostatistics,personal credit information,etc.And text classification problem also faces the problem of multiple data sources such as personal spam identification,multi-domain sentiment classification and so on.There are many researches on text clasiification at home and abroad.The mainstream method is training a classification model based on statistics or machine learning.The models have accurate and stable prediction such as support vector machine,boosting and other methods.But few scholars and research are aware of the impact of multi-source data on classification problems.Based on the original support vector machine,this paper uses the multi-source data integrative analysis method.A support vector machine model with the group penalty is proposed.The Sign-based penalty is added on the basis of the Composite MCP(Minimax Concave Penalty)penalty.The coefficient symbols of the comnon variables among the data sets are encouraged to be similar,and the data set is extracted as much as possible in the variable group.The heterogeneity and homogeneity between the two,construct cMCPs(cMCP penalty+Sign-based penalty)-SVM model,analyze the text classification under multi-source data.The method used in this paper belongs to the two-layer variable selection method.The group coordinate descent method is used to solve the optimization problem,and the sample external accuracy(Accuracy),real case rate(TPR)and AUC(Area Under Curve)values are used as evaluation model classification.The standard of effect.In the three sets of simulation experiments,the cMCPs-SVM model was compared with the cMCP-SVM model and the sub-dataset MCP-SVM model,and the variables selection effect and classification effect were evaluated.It was found that the cMCPs-SVM model has advantages and data sets.The greater the internal similarity,the more obvious the advantage.
Keywords/Search Tags:Multi-source datasets, integrative analysis, SVM
PDF Full Text Request
Related items