| In order to promote fairness in education,the state has been striving to improve the poverty-stricken student support system to ensure that every student can enjoy the basic right to receive education.However,there are many problems in the current accreditation process for granting students.Since universities do not directly contact the specific family background of students,they can only judge according to the written materials applied by the students.However,some students use social relations to defraud grant quotas by making false poverty certificates or exaggerating the poverty level of large families,which results in some genuinely poor students being unable to receive state funding.So many schools appear so-called "false identification" poor students every year,which has aroused widespread concern in society.With the development of the era of big data,more and more problems that are difficult to solve in the traditional field has been integrated the thinking of the Internet and provided new solutions.This paper trained a machine learning model that is based on the spending,learning,and living habits data of a college in the past two years generated by students,to help the manager grasp the real consumption situation and the economic level of the students during school days and provide important ideas to find “hidden poverty” or “falsely identified” students.The following work was carried out around the subject of this paper:(1)Based on the student’s consumption and behavior dataset,this paper has done some data statistics work,which show some difference between the two types of student groups on total consumption,consumption mode,student ranking and so on.And then the feature engineering is carried out basing on the analysis work,such as the original data is subjected to missing value processing,one-hot encoding and normalization.Many derivative features are constructed according to time,place and other dimensions.Then,the 68 features with the highest score are selected as the experimental sample set by the stability selection method.Finally,in order to relieve the situation that the label from the dataset is imbalance,the SMOTE algorithm is used to expand the sample of a few classes.(2)Based on the dataset after feature engineering,this paper uses naive Bayes,support vector machine,neural network,random forest,XGBoost for preliminary experiments,and use AUC to evaluate these models,which is originally designed for binary classification criteria.The results show that XGBoost is the best algorithm in this dataset as a single model.Then use the idea of grid search to find the optimal value of main parameter of XGBoost.Aiming at the shortcomings of single model being affected by dataset,this paper proposes a XGBoost hybrid model based on Bagging idea.Finally,results show that Bagging&XGBoost hybrid model has better robustness. |