Research On Feature Analysis Technology For Small Sample Data

Posted on:2022-06-10

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Peng

Full Text:PDF

GTID:2518306524490894

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Small sample data analysis is the focus and difficulty in the field of data mining.Small sample data usually has the problems of small sample size,missing data,and imbalanced data.The lack of data will not only lead to the loss of sample information,and it is difficult to ensure the quality of the sample,but will also make many statistical learning and machine learning methods unable to be applied to the data set in the future.The quality of the data determines the results of the statistical analysis.If the missing data cannot be handled properly,the final analysis result is also difficult to be representative.On the other hand,when training and categorizing imbalanced data directly,due to the large difference in sample category proportions,not only traditional performance indicators cannot be used to evaluate the classification results,but also the performance of the classifier will be greatly reduced,and it is difficult to construct classifier with good performance.This article conducts in-depth research on the above missing data and imbalanced data problems,and improves on the MissForest filling algorithm to improve the accuracy and speed of data filling.Using the method of combining data resampling and ensemble learning classification to process unbalanced data,which improves the accuracy of data classification.This article first discusses the related basic theories of missing and imbalanced data,and analyzes the causes of related problems.Aiming at the problem of missing data,introduced two types of commonly used filling methods,namely statistical learning and machine learning-based methods,and focused on comparative research on machine learning-related methods.improved the traditional machine learning filling algorithm by analyzing the interrelationship of the internal attributes of the data,and proposed an improved MissForest filling algorithm based on correlation.This algorithm is aimed at the information data set of people concerned in a specific area under different missing rates,and has a better filling effect than traditional algorithms.In terms of imbalanced data processing,this article mainly conducts research from the data level,and compares and analyzes a variety of data resampling methods.In view of the characteristics of small sample data,selected the mixed sampling method SMOTE+Tomek for data preprocessing,and optimized the proportion of various samples in the data set.In the subsequent data classification stage,mainly use an ensemble learning algorithm,combining the training results of multiple base learners,optimizing the classification results of imbalanced data,and conducting comparative experiments with other algorithms.The finally selected Light GBM algorithm has a better classification effect for the data set of people concerned in a specific area.Finally,according to project requirements,this article integrated the missing data filling and data classification modules,developed a small sample data processing software system,and demonstrated the system functions.The filling performance and classification performance are systematically tested,and the results meet the expected goals,verifying the effectiveness and applicability of the relevant algorithms used in this paper for feature analysis of small sample data.

Keywords/Search Tags:

small sample, data filling, imbalanced data, data classification

PDF Full Text Request

Related items

1	Optimized Mahalanobis-Taguchi System Classification Method For High-Dimensional-Small-Sample-Size Imbalanced Data And Its Application Research
2	Neural Networks For Small Sample Data Classification Intergraded With Decentralized Technology
3	Research On Training Set Construction Method In Pattern Classification
4	Research On Classification Of Imbalanced Telecom Customer Data
5	Neural Network Approaches For Imbalanced Data Classification
6	Researches On Oversampling Methods For Imbalanced Data
7	Two-class Imbalanced Big Data Classification Based On Data Reduction And Ensemble Learning
8	Classification Of Imbalanced Sample Based On Stream Data
9	Research On Time Series Data Classification Method Under Small Sample Condition
10	The Research Of Imbalanced Data Classification