Font Size: a A A

Research On Theories And Algorithms For Data Currency

Posted on:2017-02-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:M H LiFull Text:PDF
GTID:1108330503469766Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the arrival of big data era, the usability of data has aroused a lot of attentions. The quality of real-life data deteriorates rapidly with the elapse of time, which makes the data items in the database outdating and invalid. It is reported that the obsolete data has seriously affected national economic and society, and moreover, may reduce the usability in other dimensions, such as consistency, accuracy and completeness. Therefore, it highlights the needs of maintaining the currency of data. Present research on data currency is still immethodical, and faces great challenges. First, in many applications, the timestamps of data are either absent or imprecise, which makes it hard to determine the absolute data currency, i.e., the currency with reference to a given time point. Second,there are various requirements of data currency for different query and applications, and the absolute data currency cannot be determined in many scenarios. Thus, it is essential to determine the data currency related to query and users. Third, it is necessary to improve data currency, i.e. to repair these obsolete values into up-to-date ones, after the determination of data currency, but all the methods for repairing data cannot effectively repair the obsolete values. Fourth, it is hard, or even impossible to repair obsolete data based on single data source in some applications. The data sources may have overlaps, but always have different attributes. Thus, we may need to link multiple tables in order to get all the required current values. To address the challenges outlined above, this thesis tries to provide a series of theories and algorithms for solving the key problems of data currency.The research issues and contributions of this thesis are summarized as follows.(1) This thesis studies the problems of expressing and determining absolute data currency. To remove the limitations of timestamp-based and rule-based currency determination method, this thesis proposes a new class of uncertain currency rules and the corresponding data currency model. The uncertain rules and the corresponding model can express uncertain semantics, and determine the quantitative data currency at a given time point. The thesis first studies the fundamental problems of uncertain currency rules,such as axiomatization, implication and satisfiability. Then, a model for quantitatively determine data currency is given, and the currency of data items, tuples, and data instance are formally defined. After that, an algorithm for determining the currency of data items is proposed. Currency graph is proposed to represent the currency orders. Based on the concepts of currency graphs, the algorithm can determine the currency of all the data items in polynomial time. Finally, the efficiency and effectiveness of the proposed methods are verified using real-life data sets.(2) This thesis investigates the methods of relative currency determination. Redundant records and currency constraints can be helpful when evaluating relative data currency in case of absolute data currency cannot be determined or cannot effectively express the requirements. Based on redundant records and currency constraints, this thesis studies the problem of relative data currency, and provides the model and algorithms for determining different types of relative data currency. First, this thesis defines query related data currency. When evaluating query related data currency, all the queries are classified as two categories, which are Current Value Query and Currency Sequence Query. For each query category, this thesis gives the definition of the currency of query result, as well as the average currency of the entire query category. Second, the users are categorized into three categories according the preference of queries, and the corresponding user related data currency is studied. Experimental results on real and synthetic datasets are given to analyze the effect of parameters and the efficiency of algorithms.(3) This thesis provides rule-based models and algorithms for improving data currency. It is essential to to repair these obsolete values into up-to-date ones. Previous methods of data repairing are based on either quality rules or statistical techniques, but both of the two types of methods have their limitations. The rule-based method fall short in their ability to detect and represent the complex relationships of different values. The methods based on statistical techniques need to learn complex conditional probabilities, and may lose some important domain knowledge which can give a helpful guidance of data repairing. To overcome the shortages of the previous methods, this thesis focuses on combining quality rules and statistical techniques to improve data currency. A new class of Currency Repairing Rules(CRR for short) is proposed to express both domain knowledge and statistical information. Domain knowledge is expressed by the rule pattern, and the statistical information is described by the conditional probability distribution corresponding to each rule. The problem of generating minimized CRRs is studied in both static and dynamic world. In the static world, the problem of generating minimized CRR patterns is proved to be NP-hard, and two approximate algorithms are provided to solve the problem. In dynamic world, methods are provided to update the CRRs without recomputing the whole CRR set in case of data being changed. In some special cases, the updates can be finished in O(1) time. In both cases, the methods for learning conditional probabilities for each CRR pattern are provided. Based on the CRRs, the problems of finding optimal repairing plans with and without cost budget is studied. It is proved that the problem of finding the optimal repairing plan without cost budget can be solved in polynomial time, but the problem of finding the optimal repairing plan with cost budget is NP-hard. Methods are provided to solve both the two cases. The experiments based on both real and synthetic data sets show that the proposed methods are efficient and effective.(4) This thesis studies the query-based methods for improving data currency. In many circumstances, such as data integration or web databases, tables are distributively stored in different data sources. A table may have overlaps with other sources, but the frequency of updates on each data source is different. When a user asks for a table, it cannot be ensured all the most current values will be returned since the source may not update in time. Thus, we need to leverage the currency orders and integrity constraints,and construct a conjunctive query on multiple tables in order to get all the required current values. The query is formalized as currency preserving query. This thesis studies the problem of generating the currency preserving query for a target table based on currency orders and integrity constraints. First, the formal definition of currency preserving query is provided. Then, the schema currency graph is proposed to represent the distributed tables and their relationships. Based on the model, the thesis proves that a currency preserving query is equivalent to a terminal tree in the graph. After that, the problem of the minimized currency preserving query is discussed. The problem is proved to be NP-hard,and the heuristics strategies are provided to solve the problem in different cases. Finally,experiments are conducted on both synthetic and real data sets to verify the effectiveness and efficiency of the proposed techniques.
Keywords/Search Tags:Data Quality, Data Usability, Data Currency, Data Repairing, Data Quality Rules
PDF Full Text Request
Related items