Font Size: a A A

Study And Implementation Of A Data Quality Control System For Semistructured Data

Posted on:2009-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q FangFull Text:PDF
GTID:2178360308979379Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of information processing technology, all fields have established a lot of computer information systems and accumulated a large amount of historical data which is very important to them. In order to make data effectively supporting enterprises in the daily running and decision-making, we need to ensure that the data is accurate and can accurately reflect the real world situation. Correcting the data error is an important part of avoiding wrong decision and reducing the decision risk. So it is essential to control the data quality for data management.In the past, most researches of data quality control are concern at data in the database, which is called structured data. But due to objective factors, semistructured text data is still the main format of preservation historical data in the enterprises. To solve the issues mentioned above, the semistructured data quality control is deeply studied in this thesis, and the data quality control system for the semistructured data is designed and implemented.First, the existing researches on the data quality control and the characteristics of the semistructured data are analyzed. According to these characteristics, a semistructured data quality control model is proposed. This model achieves detecting, problem data processing and quality evaluating functions for semistructured data and provides a data abstract method which well resolves the heterogeneous problem of semistructured data.Then, three types of data processing methods in the model are proposed, which are incomplete data, inconsistent data and error data. For incomplete data problem, decisive field is presented according to the different importance of every fields in the record based on the traditional incomplete data detection algorithm, and then sorting the fields needing detected according to their importance, which reduces the unnecessary detecting times thereby improves the efficiency of the algorithm. For the error data problem, detection method based on the business rules is proposed. In order to solve the retrieval efficiency issue for the large scale rules and algorithms, two classification search strategy is proposed. For the inconsistent data problem between fields in the record, regular expression method is adopted to deal with the inconsistent data which can resolve the problem very well.Finally, the semistructured data quality control system is designed and implemented in the thesis. And then the system is applied on the ocean data environment. Through the application, the availability and effectiveness characteristics of the system are verified.
Keywords/Search Tags:semistructured data, data quality, control model, data quality control method, control system
PDF Full Text Request
Related items