| With the popularization and development of information technology,big data has been accumulated in various fields.To utilize effectively big data,data mining technology is widely used in various fields.Data warehouse is one of the bases of data mining.ETL(Extract,Transformation,and Loading)is responsible for data extraction,cleaning,transformation and loading in the data warehouse,which determines the data quality of data warehouse.Data quality is greatly reduced due to data duplication,data missing and data errors in data sources.Data quality seriously affects the efficiency of data mining and the accuracy of analysis.Data cleaning,as the main method to improve data quality,is an important part of ETL.In order to improve the flexibility and efficiency of ETL,the ECL-TL(Extract-Clean-Load-Transform-Load)framework is proposed,and data cleaning methods are systematically studied.Specific search contents as follows:(1)For designing of ETL Framework,this paper designs and implements the ECL-TL framework.Data cleaning and data transformation are completely separated by introducing a middle library and the coupling is greatly reduced.At the same time,the framework provides an efficient data cleaning solution,encapsulating algorithms,rules and evaluation libraries related to data cleaning.(2)For cleaning of duplicate records,a method of cleaning completely duplicate records based on equivalence relation is proposed.Two schemes are designed according to the size of data.The experimental proves that this method has high cleaning efficiency.To improving the accuracy and efficiency of similar duplicate record detection,a method for detecting similar duplicate records based on attribute hierarchy is proposed.In this method,data sets are clustered and similar records are filtered layer by layer.(3)For dealing with data missing,data exception,logical error data,and inconsistent data,a low-quality data cleaning method based on information value quality evaluation is proposed.The method filters out the low-quality data through the information value quality evaluation method and the above data processing is summarized as cleaning of low-quality data.The ECL-TL framework has been applied to the performance appraisal system of a police station.The experiments show that the ECL-TL framework proposed has good reliability and stability.In addition,the data cleaning method has a good effect on the data cleaning of the public security system. |