Font Size: a A A

Small-scale Data Classification On Deep Forest

Posted on:2021-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:M Y ZhangFull Text:PDF
GTID:2428330611464283Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of digital technology,a large amount of data has been generated and stored in all walks of life.The accurate classification of these massive data is the basis for the subsequent effective analysis.Due to personal privacy and security issues,in some industries with high information confidentiality,only a small amount of stored data can be obtained,and the labor cost of labeling a large number of data is too large,which makes the available data very limited.The research shows that the deep learning model needs a lot of training data,and it is easy to have fitting problems in some small-scale data tasks.Therefore,the research on small-scale dataset classification has far-reaching significance.Because of its high intelligibility and automatic determination of cascading layers,deep forest model has obvious advantages in processing small dataset classification tasks.Small sample size of small data sets usually has problems such as category imbalance and poor diversity.Category imbalance will affect the ability of random forest to effectively learn the accurate distinguishing features between categories.Poor data diversity will lead to the failure of the model to learn the overall data distribution of the original data,which may lead to over-fitting phenomenon of the deep forest model,which resulting in poor classification performance of the model.This paper makes an in-depth analysis of these two problems as follows:1)To solve the problem of class imbalance in small datasets,the strategy of building tree by class in multi-grained scanning is studied,and the Skip Connection Forest(SCForest)model is proposed.By adding skip connection in the cascade forest,the feature disappearance or feature explosion is effectively alleviated when the feature vector propagates backward.Five types of classifiers are used in the cascade layer to improve the ensemble diversity and the standard deviation of the first k important features is considered as the enhanced features,which optimizes the transmission process of the effective features in model learning.The experimental results show that the proposed SCForest model can effectively avoid the influence of class imbalance in the classification of small data sets,especially in the high-dimensional and multi-class datasets,which improves the generalization ability of the model in small datasets.(2)To solve the problem of poor diversity of small data sets,according to the superior performance of generative adversal network in generating artificial sample data,the weak labelled generated data with the same distribution as the original data is obtained.Based on SCForest,the Joint Learning Forest(JLForest)model is proposed.The JLForest model dynamically updates the weak label of the generated data by cascades through the previous i layer until it reaches a certain degree of accurate confidence.By designing the joint loss function,the method of training the cascade forest with the original data and the generated data is proposed.The experimental results show that the classification effect of generated data as additional data is slightly inferior than that of real data as additional data,and JLForest can obtain the best classification performance on these data sets by setting the appropriate data generation rate for different small data sets.In this paper,the deep forest model is studied for the problem of small data set classification.By using the strategy of building trees according to classes,we propose SCForest to solve the problem of class imbalance.By further improving the cascade forest,we improve the transmission efficiency of effective features.Then,based on the SCForest model,we propose a joint training strategy to increase the diversity of data by adding the generated samples JLForest model.Experiments show that JLForest model can improve the classification accuracy of small data sets by adding a certain amount of generated data.This method provides a new solution for special industries that can only obtain a small amount of training data.According to the data classification results,enterprises can carry out subsequent customer behavior analysis and precision marketing.
Keywords/Search Tags:Small-scale Dataset, Deep Forest, Generative Adversarial Network, Diversity, Generated Data
PDF Full Text Request
Related items