Font Size: a A A

A Study On Risk Classification Of Inspection And Quarantine Of Imports Based On Machine Learning

Posted on:2020-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhaoFull Text:PDF
GTID:2404330623463618Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In order to ensure the safety of public health,reduce the spread of pests,and ensure the quality protection of imported goods,the inspection and quarantine departments must carry out inspection,quarantine and supervision of entry and exit animals and plants and their products,import and export commodities,and transportation.With the development of global trade,the traditional inspection methods are limited by human and material resources,and can no longer meet the needs of rapid development of the global trade environment.In the era of big data,it is urgent to study the data of declaration information and historical inspection results to automatically determine the risk categories for newly declared inbound goods,thereby assisting inspectors to conduct accurate inspections and improve efficiency.By using machine learning methods,the classification model of inspection and quarantine data can help us locate potential risks from massive imported goods quickly and accurately.The thesis mainly studies machine learning techniques such as the data preprocessing,classification model construction,unbalanced dataset processing and feature dimension reduction processing based on the inspection and quarantine declaration data and the historical inspection result data.The main research are in the following:1)In the data preprocessing,the dataset is comprehensively processed according to the characteristics of inspection and quarantine data.Through the machine learning methods,the models of binary classification are constructed and the classification effects are compared.Based on the existing classification algorithm,an error correction method based on data distribution is proposed which tries to improve the classification effect by analyzing the spatial distribution after data dimension reduction,finding and changing the classification results in the interval of misclassification.2)Four existing unbalanced data processing methods are respectively used to construct a fully balanced training set in order to analyze the impact of the fully balanced data set on the classification algorithm.Then two combined methods are used to construct the datasets of different positive and negative sample proportions,and the influence of different proportions on the classification algorithm is compared and analyzed so that the ratio of positive and negative samples that is most suitable to the best classification model can be found.3)Through four existing dimension reduction methods,the effects of different feature subsets on the classification effect are compared.Then after analyzing the disadvantage of each dimension reduction method,a three-step feature dimension reduction method is proposed to compare the effects of different feature subsets and to find the optimal feature subset.Firstly,in the experiments,the error correction method based on data distribution is adopted to improve the existing methods.The highest F1 score is 0.9720,which is better than the results of the five existing methods.Secondly,in order to solve the impact of the unbalanced sample data on the algorithm,two experimental ideas,constructing fully balanced training data and constructing training data with different positive and negative sample proportions,are adopted and the ratio of positive and negative samples that are most suitable for the classification model is found by combined method.When the ratio of the positive and negative samples is 1:5,the highest F1 score can reach 0.9731,which indicates that the combined method can further improve the classification effect.Last but not least,In order to improve the efficiency of the algorithm,four existing feature dimension reduction technologies are studied.The pros and cons of each technology are analyzed.Then a stepwise feature dimension reduction method is used.The method uses a comprehensive sorting of chi-square test and information gain,and then the principal component analysis is used to extract features.500 feature items can be used to obtain the F1 score of 0.9688,which is better than the one of the original dataset.And the time of training model is reduced from 1846 seconds to 537 seconds.It shows that the method can effectively obtain a small number of feature subsets with better effects,and the algorithm efficiency is improved.
Keywords/Search Tags:inspection and quarantine, risk classification, machine learning, correction based on distribution(CBD), combination method of unbalanced sample processing, stepwise method of feature dimension reduction
PDF Full Text Request
Related items