Support Vector Machine Based On Multiple Datasets

Posted on:2020-04-05

Degree:Master

Type:Thesis

Country:China

Candidate:Z H Ge

Full Text:PDF

GTID:2370330572969690

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

The 21st century is the era of big data.The rapid development of computer technology has greatly facilitated the acquisition and storage of data,enabling many departments to generate massive amounts of data every day.Big data is usually a combination of data from different sources,subjects or formats.Data sets are different from each other due to different data sources.However,when studying the same problem,there is a certain correlation between variables and labels in different data sets.Given that big data has the characteristics of data source difference,high dimensionality and sparsity,how to mine heterogeneity and homogeneity between data sets as well as reducing dimension and denoising is one of the goals and challenges of big data analysis.At present,multi-source data analysis has been researched and applied in the fields of biostatistics,personal credit information,etc.And text classification problem also faces the problem of multiple data sources such as personal spam identification,multi-domain sentiment classification and so on.There are many researches on text clasiification at home and abroad.The mainstream method is training a classification model based on statistics or machine learning.The models have accurate and stable prediction such as support vector machine,boosting and other methods.But few scholars and research are aware of the impact of multi-source data on classification problems.Based on the original support vector machine,this paper uses the multi-source data integrative analysis method.A support vector machine model with the group penalty is proposed.The Sign-based penalty is added on the basis of the Composite MCP(Minimax Concave Penalty)penalty.The coefficient symbols of the comnon variables among the data sets are encouraged to be similar,and the data set is extracted as much as possible in the variable group.The heterogeneity and homogeneity between the two,construct cMCPs(cMCP penalty+Sign-based penalty)-SVM model,analyze the text classification under multi-source data.The method used in this paper belongs to the two-layer variable selection method.The group coordinate descent method is used to solve the optimization problem,and the sample external accuracy(Accuracy),real case rate(TPR)and AUC(Area Under Curve)values are used as evaluation model classification.The standard of effect.In the three sets of simulation experiments,the cMCPs-SVM model was compared with the cMCP-SVM model and the sub-dataset MCP-SVM model,and the variables selection effect and classification effect were evaluated.It was found that the cMCPs-SVM model has advantages and data sets.The greater the internal similarity,the more obvious the advantage.

Keywords/Search Tags:

Multi-source datasets, integrative analysis, SVM

PDF Full Text Request

Related items

1	Integrative Unsupervised Learning Based On Multi-Source Data
2	Integrative statistical methods for the analysis of transcriptomic and metabolomic data
3	Integrative Structural Modeling Of Large Biomolecules:Methods Development And Applications
4	De Novo Prediction Of Drosophila's Cis-regulatory Modules Through Integrative Analysis Of ChIP Datasets
5	Research On Methods And Integration Applications Of Polygonal Object Matching On Multi-scale Datasets
6	Multi-omics Integrative Analysis Of Aspergillus Niger Fermentation Process For Glucoamylase Production And Exploration On Metabolic Engineering Of Enzyme Production Based On Enhancing Amino Acid Synthetic Pathways
7	Statistical Methods For Analyzing Small RNA Sequencing Datasets Based On Mi RNA/isomiR Profilings
8	Development Of Hourly AOD Dataset Based On Geostationary Satellites And Fusion Of Multi-source AOD Datasets
9	Study On The Landslide Susceptibility Evaluation Method Based On Multi-source Data And Multi-scale Analysis
10	The Research Of Integrative Analysis In Heterogenous Panel Data Model And Its Application