Font Size: a A A

Feature Selection And Application Of Ultra High Dimensional Data With Spherical Variables Based On CC-SIS

Posted on:2024-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:M J WangFull Text:PDF
GTID:2530307079461534Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the advent of the big data era,high-dimensional data has become increasingly prevalent in various domains,including medicine,biology,and economics.In these domains,high-dimensional datasets often contain spherical variables that encompass crucial information,such as disease onset time,wind direction,and temporal data.Therefore,solving the problem of fast,effective,and stable feature screening for ultra high dimensional data containing spherical variables has important practical application significance.This study centers on the issue of feature selection in high-dimensional data with spherical variables and examines the efficacy of the Conditional Correlation Sure Independence Screening(CC-SIS)method employing random forest kernels.The research consists of two parts:(1)Leveraging the random forest algorithm as an adaptable kernel function,the random forest kernel CC-SIS method is proposed and integrated into the CC-SIS framework.The convergence and selection accuracy of the random forest kernel CC-SIS method are subsequently investigated through numerical experiments.(2)The method is further extended to real data,where the selected data is classified using a model,and the featire screening efficacy is evaluated using five-fold cross-validation.Finally,numerical experiments are conducted to examine the feature selection performance of the random forest kernel CC-SIS method on real-world datasets,as well as the performance and effectiveness of a fusion model incorporating the variable coefficient model,logistic regression model(LR),and convolutional neural network model(CNN)for classification.By comparing several CC-SIS methods based on different kernel functions and the classical SIS method through simulations and experiments,the feature selection performance is assessed.The results demonstrate that for high-dimensional data with spherical variables,the random forest kernel CC-SIS method achieves the highest feature selection accuracy,followed by the v MF kernel and Gaussian kernel CC-SIS methods,which exhibit slightly lower accuracy compared to the random forest kernel CC-SIS method.Lastly,the EP kernel CC-SIS method and the SIS method demonstrate relatively moderate feature selection accuracy.In the context of feature selection in real-world high-dimensional data,the classification accuracy of the selected data plays a pivotal role in evaluating the efficacy of the feature selection process.In order to ensure robust classification accuracy on real data,it is essential to consider not only appropriate feature selection methods but also the selection of an appropriate classification model.While logistic regression models and neural network models exhibit high classification accuracy when applied to datasets with favorable feature selection outcomes,their performance diminishes when dealing with datasets characterized by comparatively moderate feature selection outcomes.Conversely,the variable coefficient model consistently demonstrates stability and attains precise classification results when applied to datasets obtained through various CC-SIS methods employing different kernel functions,as well as the SIS method.By integrating these three models,the fusion model effectively enhances the classification accuracy of the variable coefficient model while preserving its stability,thus providing a substantial improvement in classification performance.
Keywords/Search Tags:Ultra high dimensional data, Feature screening, Spherical data, Kernel function, Fusion model
PDF Full Text Request
Related items