Font Size: a A A

Research And Implementation Of Data Cleaning Technology Based On Knowledge Graph

Posted on:2022-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:K M LuoFull Text:PDF
GTID:2518306524990019Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,data has become the core assets of enterprises.Analyzing and mining the potential value of data plays an important role in the business development and key decisions of enterprises.Data integration is the basis of data mining and analysis.In data integration,data quality problems such as data missing and incon-sistency may exist in heterogeneous data from multiple sources,and data cleaning is one of the important means to ensure data quality.Data cleaning technology relies on a large amount of external knowledge to guide the cleaning process.However,due to the small scale of external knowledge and the low efficiency of construction,the efficiency of data cleaning is limited.However,knowledge graph has the characteristics of large knowledge scale and rich semantics,so it is of great theoretical significance and value to study the use of knowledge graph for data cleaning.In order to solve the problems of data missing and data error in data integration,a data cleaning system KGDC based on knowledge graph is designed and implemented.The system is divided into three modules:pattern matching,pattern repair and inference repair.In terms of the relationship between the three modules,the explicit and implicit relationships of the data columns are first obtained through pattern matching,and then the schema repair and inference repair are carried out on the data table according to the result of pattern matching to complete the overall cleaning of the data table.The research contents are as follows:Firstly,the application technology of knowledge graph in data cleaning is studied.In view of the fact that data cleaning depends on external knowledge,the relationship model between knowledge graph and data to be cleaned is established,which lays a foundation for error repair of data tables.Second,a data cleaning method based on the whole data table is proposed in view of the data cleaning scenario where there is correlation between multiple data tables.The method makes use of the foreign key constraints between tables to establish explicit and implicit relationships between tables and to clean the data.This method is more efficient than single table cleaning.Thirdly,based on knowledge reasoning,a multi-table inference repair method is pro-posed.Aiming at the situation that empty candidate set or multi-valued candidate set appears in explicit repair and the data cannot be directly repaired,explicit relation trans-formation or candidate attribution confirmation is used to clean the data,which further improves the efficiency of multi-table cleaning.Finally,the proposed data cleaning method is tested,which includes single-table repair test,multi-table repair test and multi-table inference repair test.The test results show that,compared with single table repair,the more correlation between data tables,the greater the improvement of cleaning efficiency.Compared with multi-table inference repair,the more associated knowledge in the knowledge graph,the more obvious the im-provement of cleaning efficiency will be.
Keywords/Search Tags:Data cleaning, Knowledge graph, Error repair, Knowledge reasoning, Foreign key
PDF Full Text Request
Related items