Font Size: a A A

Research On Feature Analysis Technology For Small Sample Data

Posted on:2022-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z J PengFull Text:PDF
GTID:2518306524490894Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Small sample data analysis is the focus and difficulty in the field of data mining.Small sample data usually has the problems of small sample size,missing data,and imbalanced data.The lack of data will not only lead to the loss of sample information,and it is difficult to ensure the quality of the sample,but will also make many statistical learning and machine learning methods unable to be applied to the data set in the future.The quality of the data determines the results of the statistical analysis.If the missing data cannot be handled properly,the final analysis result is also difficult to be representative.On the other hand,when training and categorizing imbalanced data directly,due to the large difference in sample category proportions,not only traditional performance indicators cannot be used to evaluate the classification results,but also the performance of the classifier will be greatly reduced,and it is difficult to construct classifier with good performance.This article conducts in-depth research on the above missing data and imbalanced data problems,and improves on the MissForest filling algorithm to improve the accuracy and speed of data filling.Using the method of combining data resampling and ensemble learning classification to process unbalanced data,which improves the accuracy of data classification.This article first discusses the related basic theories of missing and imbalanced data,and analyzes the causes of related problems.Aiming at the problem of missing data,introduced two types of commonly used filling methods,namely statistical learning and machine learning-based methods,and focused on comparative research on machine learning-related methods.improved the traditional machine learning filling algorithm by analyzing the interrelationship of the internal attributes of the data,and proposed an improved MissForest filling algorithm based on correlation.This algorithm is aimed at the information data set of people concerned in a specific area under different missing rates,and has a better filling effect than traditional algorithms.In terms of imbalanced data processing,this article mainly conducts research from the data level,and compares and analyzes a variety of data resampling methods.In view of the characteristics of small sample data,selected the mixed sampling method SMOTE+Tomek for data preprocessing,and optimized the proportion of various samples in the data set.In the subsequent data classification stage,mainly use an ensemble learning algorithm,combining the training results of multiple base learners,optimizing the classification results of imbalanced data,and conducting comparative experiments with other algorithms.The finally selected Light GBM algorithm has a better classification effect for the data set of people concerned in a specific area.Finally,according to project requirements,this article integrated the missing data filling and data classification modules,developed a small sample data processing software system,and demonstrated the system functions.The filling performance and classification performance are systematically tested,and the results meet the expected goals,verifying the effectiveness and applicability of the relevant algorithms used in this paper for feature analysis of small sample data.
Keywords/Search Tags:small sample, data filling, imbalanced data, data classification
PDF Full Text Request
Related items