| With the rapid development of the Big Data Era,the amount of data is growing fast and this huge amount of data contains a lot of valuable information.In order to manage this data effectively,it needs to be classified accurately,and a key aspect of data classification is the selection of attributes.The application of feature selection to balanced datasets is well researched,but in practice,more and more data are found to have a long-tail distribution,i.e.the head labels contain more samples and the tail labels contain fewer samples,which poses a great challenge to feature selection.Due to the imbalance of data distribution,feature selection tends to focus too much on the features of the head labels with more samples and ignore the features of the tail labels with less samples,making the classification accuracy of the tail labels lower and the classification results skewed.Because tail labels account for a large proportion of the total number of labels and contain a large amount of potentially valuable information that can be mined,the learning of tail data needs to be given some attention.To address this problem,this paper constructs a feature selection model(Long-tailed Data-driven Feature Selection,LDFS)applicable to long-tail data as a way to improve the accuracy of the overall classification of long-tail data.Firstly,as the sample size in the head labels is large,so there are many common features in the samples,the model is constructed to extract the common features of the head labels to learn the head labels;secondly,as the sample size in the tail labels is small,so the samples have specific features,the model continues to be extended to extract the specific features of the tail labels to learn the tail labels,so as to improve the recognition accuracy of the tail labels;finally,the model is embedded in the head labels and the correlation of tail labels to explore the complex relationship between features and labels.In this work,Alternating Direction Method of Multipliers is used to find the optimal solution of the model.The feature selection model proposed in this paper is applied to long-tail data,taking into account both head labels and tail labels to avoid the bias of feature selection results caused by the excessive number of head labels samples.Based on the fact that the Wikipedia corpus also presents a long-tail distribution,the feature selection model constructed in this paper is applied to Wikipedia for text classification.In this paper,Micro-F1,Macro-F1 and Hamming losses are mainly used as indicators to assess classification performance.Experimental results show that compared with some classical feature selection methods,the LDFS model constructed in this paper achieves better results for different evaluation metrics,LDFS has better performance.It also shows good generalization performance when combined with a variety of classifiers.In general,the method proposed in this paper takes into account the characteristics of both head labels and tail labels,and directly learns head labels and tail labels according to the characteristics of head labels and tail labels.Moreover,the correlation of head labels and tail labels,is added to facilitate the exploration of the relationship between tags,so that the model not only has good performance,but also has good applicability on different long-tail data sets. |