Font Size: a A A

Domain-oriented Web Data Integration

Posted on:2018-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:C J GengFull Text:PDF
GTID:2428330545498578Subject:Engineering
Abstract/Summary:PDF Full Text Request
The era of Big Data is uponus,relying on the rapid development of the Internet and the explosive growth of Web data,we can get more valuable information from the Web data,and one of the main tasks before Web data analysis is data integration.However,a large number of data from different publishers,and on the web,many data are isolated released even in the same field.Which brings the unprecedented challenges to the domain-oriented Web data integration,and makes the Web data integration become more and more important.Web data integration differs from traditional data integration in many dimensions:the number of data sources,even for a single domain,has grown to be in the tens of thousands;the data sources are extremely heterogeneous in their structure,with considerable variety even for substantially similar entities;the data sources are of widely differing qualities,we may often encounter the situation of data duplication and conflict in the process of Web data.integration.According to these characteristics,the main issues involved in Web data integration include establishing Web data pattern and domain integration data pattern,mapping for domain patterns and Web data patterns,matching entity in the process of integrating data.In view of the above problems,most of the current research work is carried out independently,on this basis,the paper further studies the combination of the various parts.Based on the domain requirement and Web data characteristic,we studied the pattern layer and the instance layer of data integration respectively.The pattern layer mainly focus on establishing pattern and mapping of pattern.The instance layer mainly studies the entity blocking and matching.We combine with the actual project,and use the real data sets of our project.The main research contents and related work are as follows:(i)For the pattern layer,we introduced the do main-oriented Web data integration architecture,which reflects the relationship between the various components of the structure from the Web data extraction to Web data integration;we also established the web data pattern and data model,as well as the domain data pattern and data model,hence we can establish a unified data pattern for different source of data;based on the concepts of Web data model and Web data pattern,domain data model and domain data pattern,the mapping method of Web data pattern and domain data pattern as well as integration method on data level are proposed to solve the conflict problem of pattern layer and data layer in the integration process.(ii)For the instance layer,the complexity comparison of entity blocking method is introduced,and we use the optimal blocking method to reduce the search space of entity matching;for the situation of Web domain entities are incrementally recognized and constantly changed.This paper gave the rules model of entity matching based on Second-Order Markov Logic,and proposed an algorithm to extract matching function based on the model.Accordingly,we can combine with the optimal block strategy for entity blocking.We can also use the implicit relationship of entities for matching in each block.Solve the uncertainty problem of data that caused by entities and attributes are constantly updated.We design an entity matching scheme based on random forest,and compare it with the entity matching scheme based on Second-order Markov logic network.(iii)Experiments show that the real estate information platform and integrated application system are developed with the actual requirements,and the effectiveness of our model and algorithm for the pattern layer is verified.For the instance layer,our results showed that the rules model and the new algorithm has lower matching cost in entity blocking,and the proposed method has higher precision of entity matching than baseline,and maintain outstanding scalability.It is also shown that the entity matching model based on random forest is slightly lower than the model based on second-order Markov logic network,mainly because the second-order Markov logic network considers the connection between entities,and applies them to the decision of entity matching.
Keywords/Search Tags:Web data integration, Entity matching, Second-Order Markov Logic network, matching function, matching rules
PDF Full Text Request
Related items