Font Size: a A A

Research On Overlap Estimation Technology For Web Databases

Posted on:2010-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y MiaoFull Text:PDF
GTID:2178360275459248Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Integrating data information in Deep web is a very important job,in this work people often encounter the problem of redundant information and removing the duplicate database records,they often become the key to the success or failure of integration.Estimating the overlapping rate between Web databases can help to optimize the work of resolving redundant information and removing the duplicate database records,to reduce the blindness of the integration work.The thesis contains three main parts:1.In the second chapter,we propose the approach of estimating the overlap between web databases in the ideal case,including the naive approach and the improved approach based on the naive approach.The naive approach of estimating the overlap of web databases covers the estimation flow from the first step to the end,but ignores the complexity of the Web.The improved approach improves the efficiency of sampling and estimation with the method of high-frequency words sampling.2.Contraposing the records matching problem of web databases ignored by the overlapping rate estimation approach in the ideal case,we propose the method of entity recognition in the work of estimating the Overlap-Rate between web databases.Based on deep web query interface and return characteristics of the Record,we introduce domain knowledge and pre-processing in entity recognition,and calculate the similarity of the web database records.From this engineering point of view,the method can reduce the complexity of recognition,improve the accuracy and efficiency of recognition.3.In order to further improve the adaptability of estimation approach,we propose the amendatory solutions for the approach of estimating overlapping rate between web databases.Using the regression analysis,we can set up relationship between the databases' similarity and the estimation's bias.The relationship can help us to predict the estimation's bias with the databases' similarity,which can provide a range of the real overlapping rate between the web databases.We carried out many experiments to verify the various theories and methods proposed in this thesis,propose the problems which need the further in-depth solution,and look forward to the direction and prospects of research and development in the field.
Keywords/Search Tags:Deep web, web database, overlap, estimate, high-frequency word
PDF Full Text Request
Related items