Font Size: a A A

Research On The Compression-based Approximate Query Method For Massive Incomplete Data

Posted on:2017-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:G H LiuFull Text:PDF
GTID:2308330482499722Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, the rapid development of information technology leads to the explosive growth of data scale. The massive data brings a wealth of information, but at the same time it also brings many quality problems, the data missing among them is a common phenomenon, which seriously restricts the application value of the data. At present, the treatment of massive incomplete data has caused people’s extensive concern and achieved certain results. The existing methods carry on the data cleaning for the massive incomplete data before using it, then compress and manage the cleaned "clean" data. However, for the massive data, the traditional data cleaning methods cost too much, and the data can not be completely repaired. In order to improve the operation efficiency, the treatment of the massive incomplete data needs new solutions. Therefore, this paper proposes a compression-based approximate query method for massive incomplete data(AQ-MI).This paper carries on the statistics of the frequent used data operations to get deterministic query conditions and uncertain query conditions in order to establish indexes when the data compresses and stores. This paper adopts the strategy of marking missing data, which marks the data with missing attribute value in the original data, distinguishes the complete and incomplete component, and maintains the correct identity in the query answer. This paper proposes double compression mechanism, the method calculates deterministic query conditions and uncertain query conditions of each tuple of the marked data, establishes the corresponding indexes and realizes the first stage of compression. This paper designs an attribute partition strategy to partition the attributes of the index file, and the encoding dictionary is used to compress the index file for the second stage, which can further reduce the storage space. This paper presents an approximate query method without decompression, the method analyses the query condition, establishes the query index, and does the selection and projection operations on the index compression file in order to obtain the compression address of the query data, finally gets the approximate query results of the incomplete data. Finally, this paper also raises the hard and soft optimization strategy of the multiple compression of the incomplete data. Hard optimization realizes the once compression of each tuple at the cost of increasing the cache block, it ensures the integrity of query results. Soft optimization adopts a subjective and objective weight distribution strategy based on the attribute importance, calculates the importance of each attribute and distributes the weight, on which the compression focuses.A lot of simulation experiment results show that the AQ-MI method proposed in this paper can quickly locate the compression position of the query data, improve query efficiency, also can greatly reduce the storage space, and ensure the integrity of the query results.
Keywords/Search Tags:incomplete data, approximate query, data compression, index, encoding dictionary
PDF Full Text Request
Related items