The college education poverty alleviation is an important part of the targeted poverty alleviation policy in China.At present,the national education poverty alleviation work mainly includes the qualification and the poverty level division of poor students.The application of information management system provides great convenience for data collection of poor students.However,these data have not been processed effectively,so it leads to a large number of data accumulation.At the same time,there are also quality problems such as data missing,data noise and data redundancy.Scientific and efficient management of poor student data,further use of data mining methods to improve the accuracy of poor students’ identification and poverty level division,so as to provide more targeted funding for poor students according to the poverty level,which is of important practical significance to the realization of national education targeted poverty alleviation.Most of the existing classification methods for poor students adopt manual identification methods or traditional machine learning algorithms.Manual identification methods are often unscientific,inaccurate and inefficient.And traditional machine learning algorithms seldom consider the different importance of each poverty level,which lead to the low classification accuracy of high poverty level students,resulting in unfair funding.This paper takes the data of poor students in Guangxi as the research object,carries out research on the data preprocessing and poverty level classification of poor students in Guangxi,and provides appropriate solutions.The main contents are as follows:(1)Aiming at the quality problems of poor student data in Guangxi,such as a large number of missing values,uneven distribution of attributes,noise and so on,a data preprocessing method based on feature selection which called DPFS is proposed.The method includes four stages: data preparation,feature range division,feature combination and missing number screening.The first is the data preparation stage.Then,the feature range of the prepared data is divided by using feature selection algorithm.After the stage of feature range division,the features are combined according to the real distribution uniformity of the data attributes.Finally,select the optimal data set with the maximum missing number and classification accuracy in the missing number screening stage.Experimental results show that this method can improve the quality of data sets effectively,make the performance of the classifier better,and lay a data foundation for the subsequent training of poor student grade classification model.(2)Aiming at the problem of data imbalance in identification and classification of poor students,traditional machine learning classification algorithms do not consider the cost of misclassification of different classes.When classifying,they tend to favor the majority class and ignore the minority class,which leads to the low classification accuracy of the minority class.And the classification accuracy of high poor students is low.Compared with traditional machine learning classification algorithms,CART decision tree algorithm performs better.This paper proposes a method of poor student grade classification based on cost-sensitive which called CSPSC,this method introduces cost to the traditional CART algorithm and sets a new evaluation function at the same time.By assigning appropriate cost and weight to the minority class,the classifier can pay more attention to the minority class when classifying.And a new model of grade classification of poor students in colleges has been built.The experimental results show that,compared with the traditional CART algorithm,the classification accuracy of CSPSC method for high-level poor students is improved by about6%,which can achieve the accurate classification of high-level poor students better,solve the unfair problems effectively in the education poverty alleviation work,and provide a reference for the education targeted poverty alleviation and targeted funding. |