Font Size: a A A

Research On Querying Missing Data

Posted on:2018-08-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y MiaoFull Text:PDF
GTID:1318330518473526Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer,Internet,communication and positioning technologies,the volume of data is increasing rapidly.The data has several properties,such as high-dimension,multi-source,heterogeneous,uncertain/incomplete,and so on.Missing data having incompleteness/uncertainty is ubiqutious,and widely exists in many aspects of research fields and society life,e.g.,communication,transportation,and economics,etc.Query processing is a fundamental problem in computer science,which is useful in almost all kinds of computer applications.Consequently,it has become a major challenge to efficiently and intelligently perform query processing,aiming to obtain valuable information and serve the society accordingly.However,existing techniques of query processing mostly focus on incomplete data.There exists big differences between the properties of complete data and incomplete data,and thus,the traditional techniques for querying complete data cannot(directly)support querying missing data.Motivated by these,in this paper,we systematically explore query processing technologies on missing data.They mainly include as follows:1)Query incomplete data.Similarity search plays an important role in many real-life applications such as GIS and multimedia retrieval.Most of the state-of-the-art methods for similarity search only focus on complete data.However,the data set probably contains missing information in many practical scenarios,e.g.,the missing attribute values of the data objects.Hence,the similarity relationship between those objects cannot be measured by the conventional similarity metrics.To this end,we develop two effective index techniques,and propose an exact algorithm and an approximate algorithm,to support k nearest neighbor search on incomplete data.Furthermore,the preference query has a large application base in decision making,profile-based services,recommendation systems,etc.For instance,in a movie recommendation system,the ratings of the movies are usually incompllete.The system can select top-k movies with the best ratings by top-k dominating queries and thus provide them to moviegoers.As a result,in this thesis,we develop skyband pruning,upper bound score pruning,bitmap index,as well as the compression techniques.Based on them,we present several algorithms to address the top-k dominating query in the context of incomplete data.Moreover,we explore the effect of bitmap compression techniques on the query efficiency and the bitmap space consumption,and further discuss the tradeoff between the query effiency and the bitmap space consumption.2)Querying uncertain graphs.Although there exist many efforts on the query processing over uncertain graphs,it cannot meet the emerging complex and diverse query requirements.For example,reverse knearest neighbor search on uncertain graphs plays an important role in real-life applications such as optimal resource allocation,essential protein identification,and so forth.Moreover,there exists no solution to this problem.In view of this,we systematically study the problem of reverse k nearest neighbor search on uncertain graphs(UG-RkNN).We develop three pruning heuristics to minimize the uncertain graph scale,the number of possible graphs,and the number of candidate data nodes.In addition to presenting an exact algorithm to support the UG-RkNN problem,we propose a novel sampling algorithm,which integrates the heuristic of pruning graph scale and an adaptive stratified sampling method.The variance of this method is as at least good as that of random sampling method.Our presented exact algorithm is in average five orders of magnitude better than the existing approaches.Moreover,our proposed sampling algorithm is at least three orders of magnitude better than our exact algorithm.3)Optimizing probabilistic query quality.In many real-life scenarios,the data involves with uncertainty.It makes the probabilistic query prone to return a group of uncertain query answers,and thereby the query result set contains much noise,which significantly degrades the query quality.Therefore,we present a novel framework to optimize the quality of probabilistic skyline query and probabilistic similarity search.The goal of the framework is to find a group of uncetain objects to clean under a given budget such that the query quality is maximized.A newly structure,termed as ASI,is employed to accelate the quality computation.Based on two effective pruning heuristics,we proposed a branch and bound algorithm,a greedy algorithm,and a sampling algorithm to address the object selection problem.The efficiency of the qulaity computation method using ASI structure is demonstrated to be 2-3 orders of magnitude better than the one that does not use ASI structure.4)An application system using queries on missing data.Based on the above proposed techniques,we develop a restaurant recommendation system using preference queries on incomplete information.This system is capable of friendly recommending desirable restaurants based on preference queries that take the incomplete ratings information into consideration.It is based on an extended PostgreSQL database that integrates two types of preference queries,namely,skyline and top-k dominating queries over incomplete data.The system incorporates three functionality modules including friendly and convenient query submission,flexible and useful result explanation,and timely and incremental dataset interaction.
Keywords/Search Tags:Missing Data, Incomplete/Uncertain Data, Uncertain Graphs, Query Processing, Query Quality, Systems
PDF Full Text Request
Related items