Font Size: a A A

The Research On Real-time Data Integration Based On Reverse Cleaning And Data Accuracy Evaluation

Posted on:2013-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y TangFull Text:PDF
GTID:2248330395484911Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology, the data areconstantly emerging, and presents the characteristics of heterogeneous, autonomous.It is become an important issue that how enterprise analysis data more efficiently andcorrectly. The timeliness of the data requires the data update frequency faster andfaster. The traditional data integration methods is generally updated daily, weekly oreven monthly, this can hardly satisfy the current demand. In terms of data quality,despite there are large amount of available data, but the data quality always be atroubled difficult point. For more accurate decision-making, many companies spend alot of manpower and resources to improve the quality of their data, but still with littlesuccess.Heterogeneous data integration and data quality should be seamless to worktogether, the data integration process is a continuous process, and the data quality isthe same. At present, in building a data integration system, the greatest difficulty isthe updata of the real-time data and data accuracy problems.For real-time update of the data, this paper uses the technology about adapter andreal-time thread to judge the timestamp to achieve real-time data loading mode basedon the traditional ETL process. Once the original data has been updated, thenreal-time data is loaded into the data center. To the problem of data quality, this paperpresents a data reverse cleaning method. It uses the data source tree that builded in thedata integration process to find the location of the original data quickly when thereverse cleaning is executed. Also, it makes the original data reverse cleaned, matchedand modified, improves the quality of the original data and provides high-quality datafor the platform.Moreover, in the data quality assessment, this paper put forward a data qualityassessment method—DAA algorithm, which is based on Bayesian network and the PCalgorithm. In this method, the data set is builded to the network, the PC algorithm isused to eliminate the side of independent nodes, and then calculate the average degreeof the network. This method is able to judge the accuracy level of the two data sets.When experimental verification of this method, we apply the DAA to two known BNnetworks to verify the validity of the method. This method has certain significance inartificial intelligence and knowledge discovery field. The heterogeneous data integration model proposed in this paper applied inLocal Search Service Project. We designed and implemented a prototype of thebusiness data integration system. We use the system prototype to validate the dataintegration process and reverse cleaning process, then use field-proven calculations tocompare data accuracy of the original data integration front and back and the reversecleaning data. The experimental results show that the quality of the data after theintegration significantly higher than the original data, the increased average is14.8%;and the original data quality has improved5.15%after reverse cleaning. Theexperimental results strongly illustrate the effectiveness of the data integrationprocess and the reverse cleaning process.
Keywords/Search Tags:Data Integration, ETL process, Real-time, Reverse Cleaning, DataQuality, Accuracy Assessment
PDF Full Text Request
Related items