With the rapid development of China’s economy in recent years,China’s digital construction has also been steadily improving.Thanks to China’s demographic dividend and the strong support of national policies for entrepreneurship,various micro data sets have been generated.Some of these data sets can be used as the basis of multiple other data sets,so that these multiple isolated data sets can be combined to form a new data set that can reflect more information.Such data sets play a bridge role and are particularly important.However,at present,the associations between various data sets are mainly based on text matching,except for a small number of unified codes.However,due to the development of informatization in our country,the insufficient level of informatization of earlier personnel,and the relative independence of each data set,a series of problems such as incomplete data structure,irregular input,and inconsistent recording methods have been caused.This is very unfavorable for the joint generation of multiple data sets mentioned above to generate new data sets,so as to carry out research work with new perspectives.Therefore,the analysis and matching of the related fields of the two data sets is extremely important.At present,the main method to solve this type of problem is to standardize the text and then perform similarity matching.However,this method has an obvious drawback,that is,it is powerless in the case of missing data.Therefore,the matching technology between databases needs to be further studied and improved.In view of this,the purpose of this article is to introduce the concept of machine learning based on the current text matching method,based on the Chinese customs database and the Chinese industrial enterprise database,and use the classification algorithm of machine learning to make up for the shortcomings of the text-based matching method.This article focuses on the following aspects of the application of machine learning classification algorithms in the matching of two databases:(1)Since the customs database is divided into 12 sub-databases by month,and the dimensions of the records are recorded by time flow,it does not meet the dimension of matching by company name.To this end,this article uses database technology to first integrate 12 sub-libraries into a complete annual database,and then re-aggregate and transform the database according to the company name,unifying the statistical time and statistical dimensions of the two databases.(2)Because the enterprise name field in the customs database has the problem of irregular and missing entries.For this reason,this article obtains the formatted company name through text conversion operations such as turning full-width characters into half-width characters and removing meaningless characters,which improves the success rate of text matching in industrial databases.(3)This article uses a classification algorithm based on supervised learning and requires a labeled data set as training data.For this reason,this thesis uses the exact matching of the text and the fuzzy matching of the Lewinstein distance algorithm to obtain a data set labeled with the company name,and then transforms the data set through feature engineering to obtain a training data set that can be used for machine learning.(4)In order to explore the effectiveness of machine learning classification algorithms in database matching,this article uses the Chinese customs database and the Chinese enterprise database as the basis,through actual operations,established a set of general data preprocessing procedures,and finally obtained three suitable for China Classification model of the customs database. |