Font Size: a A A

Research On Set Similarity And Set Containtment In Main Memory Database

Posted on:2013-10-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:L Y JiaFull Text:PDF
GTID:1228330395975802Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Set query and set join are important topics and attract a lot of attention. They consist oftwo parts: set containtment (query or join) and set similarity (query and join). Set query andset join have important research value and application prospect in many fields, such asdatabase, datamining, information retrieval and biological information systems.Index is a key mean to inprove the efficiency of set query and set join. Many indexessupporting set-value attributes have been designed and many algorithms have been proposedon these indexs. But many of these indexs and algorithms are only efficiency on some ofpredicates, or support all predicates but with a low efficiency. To address the issue of setquery and set join efficiently in main memory database, based on related works, the mainwork of this paper includes the following:1. This paper gives a survey of related works and points the advantages anddisadvantages of related indexes and algorithms supporting set query and set join. Toovercome the problem of some indexes and algorithms only efficient on one or servalpredicates, this paper gives the definition of set containtment and set join under an unifiedframework and realizes all query predicates on an unified index structure.2. A new trie based index--ETI is proposed to facilate T-overlap query by expanding thestructure of trie node. By transforming T-overlap problem to finding T-nodes on ETI, T-Similarity algorithm is proposed. The following three steps is used to expending T-overlap toother similarity querys:1) Mapping similarity thresholdτto T-overlap threshold.2)Proposing T-SimilarityExact based on T-Similarity to compute exact overlap value.3)Verifying candidates. The definition of record order is been defined and the impacts on ETIand T-similarity of different record orders are reviewed. The efficiency of ETI and relatedalgorithms is verified by extensive experiments.3. To treat set containtment query, Expand ETI by split NIL of ETI nodes into two parts:ENNIL and NNIL. Efficient methods, such as suffix-filtering and value-judging are proposedto speed up subset query. Additional methods, such as null-ENIL-judging, single-path-filtering and value-inequality-judging are also proposed to speed up equlity query. Efficientsuperset query algorithm is realized by traversing from root to leaf in ETI. The ETI based setcontaintment algorithms are compared with inverted index based algorithms.4. The inverted index based T-overlap algorithms usually employ a pruning-and-verification framework. A large number of candidates are generated to be verified, thusundoubtedly reducing the query efficiency. In this paper, a dynamic trie index, DTI is designed and efficient T-overlap algorithm, Dtrie-allpair is proposed. Dtrie-allpair can filterrecords with threshold smaller than T by adopting length-filtering. For records with thresholdno less that T, Dtrie-allpair can guarantee that only one result will be generated for asimilarity pair by first querying the record in DTI and then indexing it. Besides element order,record order and combine order are also proposed to evaluate the impact on efficiency of setjoin. There are no candidates generated for Dtrie-allpair and extensive experiments show thatit outperforms Allpair and PPJoin which are the-state-of-the-art algorithms.5. The CPU based inverted index algorithms are low efficiency when treating T-overlapquery. Most algorithms are only efficient at a certain small scope of threshold. To addressthese problems, the GPU based T-Overlap algorithms are investigated. ScanCount arechoosed as underlying algorithm. Depending on queries processed in parallel or serially, twoserial algorithms (GS-Serial and GS-Serial-Atomic) and one grouped parallel algorithm (GS-Parallel-Group) are proposed. To decrease the overhead of GPU memory, GS-Parallel-Group divides queries into groups, thus have nearly optimal performance while keeping arelative small memory size. The results are obtained directly in GPU by efficient GPUprimitives designed. The efficiency of these algorithms and reasonable size of group areinvestigated by extensive experiments.6. The inverted index based set containtment query algorithms execute a large number ofcomparisions thus reducing the efficiency. In this paper, set containtment query is researchusing GPU. Element-oriented, single-kernel list intersection algorithms are proposed toaddress subset and equality query. Efficient GPU primitives are designed to get resultsdirectly in GPU. GS-Parallel-Group is modified to adapt superset query. Experiments showthat GPU based algorithms have a high speedup over CPU based algorithms.
Keywords/Search Tags:Set query, Set join, Set containtment, T-overlap, Set similarity, Expanded trieindex, Dynamic trie index, Inverted index, GPU
PDF Full Text Request
Related items