Font Size: a A A

Research On Detection And Repair Of Inconsistent Data Based On Semantic Correlations

Posted on:2020-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:M Y SongFull Text:PDF
GTID:2428330578957211Subject:Computer technology
Abstract/Summary:PDF Full Text Request
A large amount of web tables is stored in the Internet.These web tables contain rich semantic information,but the data in the table usually has inconsistencies.This kind of error caused by data inconsistency may bring different degrees of trouble to the users of the web tables.Researchers have proposed a number of data cleaning algorithms for web tables to clean up inconsistent data in web tables.The existing data cleaning algorithm has certain limitations:On the one hand,when cleaning inconsistent data,the algorithm only uses a small amount of semantic information in the web tables and needs artificial constraints,which results in poor flexibility and additional resource overhead.On the other hand,due to problems such as incomplete detection of errors in the algorithm,the quality of data cleaning is reduced.In summary,this paper proposes an inconsistent data cleaning method based on semantic correlations.In response to the above first problem,this paper proposes a semantic correlation construction algorithm based on web tables.The algorithm first uses the pre-trained word vector to represent the column labels in the web table,and secondly identifies the most important key columns of the semantic information in the web table through the overall semantic relevance,and finally uses the hierarchical semantic relevance to construct the semantic correlations between the column labels.Experiments show that the semantic correlations proposed in this paper can be used as an effective constraint information to assist data cleaning algorithm.For the second problem mentioned above,this paper first uses the word vector to preprocess the spelling errors in the web tables.Secondly,in order to effectively reduce the impact of cross-interference items on the detection and repair of inconsistent data,the web table is subjected to block pre-processing using key columns.Finally,the semantic correlations are used as the constraint information,and the partitioned table is cleaned by the idea of the largest independent set,and the cleaned tables are combined and cleaned again.Experiments show that the proposed algorithm achieves better cleaning results on data sets and is superior to existing data cleaning algorithms.
Keywords/Search Tags:Distributed Representation, Semantic Correlation, Web Table, Data Cleaning, Consistency
PDF Full Text Request
Related items