Font Size: a A A

Novel Data Management Technologies For Histogram-based Applications

Posted on:2014-07-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XuFull Text:PDF
GTID:1318330482456116Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Histograms are proposed by Karl Pearson, who has been credited with estab-lishing the discipline of mathematical statistics in the 20th century. Having been developed for many years, the histogram nowadays becomes one of the most im-portant mathematical tools for estimating the probability distributions of statistical objects. During the last few decades, histograms have been widely used in many fields of information science. For example, they are utilized to publish statistical data, model probabilistic data and represent features of multimedia data. However, those practical histogram-based applications are faced with many challenges. First, a recent study finds that the direct release of some statistical summary information, such as histograms, of sensitive data also encounters a risk of revealing individuals' privacy. Second, existing similarity query processing techniques on histograms can only support some simple distance functions (such as the Lp norms), when quanti-fying the dissimilarity between histograms. The query results based on such simple distance functions usually fail to match human expectation. To overcome these problems, this dissertation targets on developing novel data management technolo-gies for histogram-based applications, which will make theoretical significance and practical value for further promoting these applications.Based on typical application scenarios, this dissertation proposes novel data management technologies for histogram-based applications from three aspects, namely the histogram construction, the histogram indexing and storage, and the histogram similarity search. As for the histogram construction aspect, researchers from the information security domain discovered that histograms directly built on sensitive datasets might leak users' privacy. Being a new privacy protection model, Differentially Privacy has attracted many attentions, since it does not have any stringent restriction on attackers' background or attacking techniques. Therefore, designing differentially private histograms with high data utility becomes a research hotspot in the data security domain. A differentially private histogram is deemed to protect individual's sensitive information from being revealed. In this disserta-tion, we focus on building differentially private histograms with high data utility. As for the histogram index, storage and similarity search aspects, most existing re-search results from computer vision domain show that the Earth Mover's Distance (EMD) is significant in returning meaningful histogram similarity search results that are more consistent with human perception to the similarities. However, the calculation of EMD owns a cubic time complexity, which makes the histogram sim-ilarity search based on EMD quite challenging. In this dissertation, we target on the Earth Mover's Distance, and present EMD-oriented index and storage structure for histograms. On the basis of such index and storage structure, some efficient and effective EMD-based histogram similarity search algorithms are proposed under the context of both databases and data streams. The index method and similarity search algorithms presented in this dissertation apparently improve the EMD-based histogram similarity query processing. The main contents and contributions of this dissertation are listed below.1. Two differentially private histogram construction methods, namely Mean-NoiseFirst and Mean-StructureFirst, are proposed based on the constructing technique of V-Optimal histograms. By building the V-Optimal histogram structure on the original count sequence, similar consecutive counts are merged into a single histogram bin. The merging of consecutive similar counts can not only smooth the zero-mean Laplace noise added to each count, but also reduce the scale of noise injected to each count to satisfy the constraint of differen-tial privacy. Since the average count within each histogram bin is used to approximate the set of counts in that bin, both of the Mean-NoiseFirst and the Mean-StructureFirst are named as the mean-based differentially private histogram construction methods. By performing lots of experiments using sev-eral real-world datasets, both of the Mean-NoiseFirst method and the Mean-StructureFirst method apparently increase the accuracy of the constructed differentially private histograms compared with all state-of-the-art methods.2. Two median-based differentially private histogram construction methods, namely Median-NoiseFirst and Median-StructureFirst, are proposed based on the framework of Mean-NoiseFirst and Mean-StructureFirst methods. By uti-lizing the median count within each histogram bin to approximate the set of counts in that bin, the median-based histogram construction methods are not sensitive to the extreme noise added to each count. Experimental re-sults show that the Median-StructureFirst method obviously outperforms the Mean-StructureFirst method, and successfully increases the accuracy of the constructed histogram by an average of 1 time compared with all state-of-the-art methods. While the privacy protection degree is high, the Median- NoiseFirst method is better than its counterpart mean-based solution, i.e., Mean-NoiseFirst. The Mean-NoiseFirst method on the other hand outper-forms the Median-NoiseFirst method when the privacy protection degree is low. Particularly, when answering unit length range count queries, the his-togram built by the Median-NoiseFirst method is 5 times better than the histogram constructed by the state-of-the-art Dwork method, considering the precision of query results.3. An index and storage scheme for histograms on the basis of Earth Mover's Distance (EMD) is proposed. Utilizing the primal-dual theory in linear pro-gramming, each histogram is mapped into a 1D domain of real numbers. After that, a B+tree-based index structure is built naturally in the 1D mapping domain. In order to support the similarity search based on the B+tree in-dex structure, we build a connection between the B+tree keys of histograms and the Earth Mover's Distance between histograms. Being different from the EMD-oriented index structure proposed in a previous work, our B+tree-based index structure can effectively prune unpromising records for the query histogram without making false negative errors.4. Two efficient and effective EMD-based histogram similarity search algorithms, namely the range query algorithm and the k-nearest neighbor query algorithm, are proposed under the context of databases, with the aid of the B+tree-based index structure proposed in this dissertation. By employing the B+tree-based index structure, the algorithms successfully filter out a large amount of irrele-vant histogram records, and therefore boost the EMD-based similarity search procedure. Extensive experiments show that our two proposals successfully accelerate the EMD-based similarity query processing by on average 1 time compared with the state-of-the-art method.5. An efficient and effective EMD-based processing scheme for histogram contin-uous similarity search is proposed under the context of data streams. A group of feasible solutions derived from the dual problem of the EMD are employed to eliminate unpromising histogram records in the data stream. Moreover, two optimization strategies, namely adaptive feasible solution update and multi-query optimization, are proposed based on new properties of the data stream. These two strategies further improve the processing speed of the continuous similarity search. A demonstration system is also designed under the appli- cation scenario of online video frame copy detection, with the support of our EMD-based histogram continuous similarity search algorithm. The demon-stration system ensures a real time processing rate of 40 frames per second even if there are 80 continuous similarity queries being concurrently processed in the system. Moreover, due to the utilization of the EMD as the underlying similarity function, the demonstration system has a strong ability of identify-ing similar frames in the data stream.In conclusion, this dissertation focuses on the problem of constructing differ-entially private histograms, and the problems of building effective histogram index (storage) structure and efficient histogram similarity search algorithms based on the Earth Mover's Distance. Several efficient and effective solutions are proposed to-wards these problems. Theoretical analysis and experimental results show that our proposals apparently outperform those related works.
Keywords/Search Tags:histogram, privacy protection, differential privacy model, V- Optimal histogram, similarity search, Earth Mover's Distance, primal-dual theory in linear programming, B~+ tree index
PDF Full Text Request
Related items