Font Size: a A A

Research On Subtree Results Generation Of XML Keyword Search

Posted on:2014-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:B WangFull Text:PDF
GTID:2268330422466783Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Keyword search over XML data has attracted a lot of research efforts in the lastdecade, where a core problem is how to efficiently answer a given keyword query.However, most existing methods addressing efficiency focus on computing qualifiedroot nodes, such as SLCA or ELCA nodes, as efficient as possible. In fact,constructing subtrees result is not a trivial task. It is also the core factor that affect thekeyword search system performance. This paper is going to solve these problemsbelow.Firstly, most existing methods adopt three-step strategy to get the subtree result:(Step1) scan all nodes in the set of inverted lists to get SLCA or ELCA results.(Step2)get a set of relevant keyword nodes(RKNs) for each SLCA or ELCA reslut v byrescanning all these nodes again.(Step3) construct a PathSubtree for each RKN setand the prune it.By studying the XML keyword subtree results and their generationalgorithm, we found the three problems:(1) existing methods cannot correctlyidentify RKNs for ELCA semantics;(2) existing methods need to buffer lots ofuseless nodes when construct the subtree;(3) there is no method which satisfy withSLCA and ELCA can adapt the various subtree genneration constraint.Secondly, for the first problem, we formally propose the definition of RKN andthe RKN-Base algorithm, which can correctly tell whether a given node is an RKN ofsome ELCA node by sequentially scanning the set of inverted lists once. AsRKN-Base cannot avoid processing all useless nodes, we then propose an optimizedalgorithm, namely RKN-Optimized, which computes RKN sets based on the set ofELCA nodes, rather than the set of inverted lists as RKN-Base does. As a result,RKN-Optimized avoids processing useless nodes, and reduces the time complexity.Thirdly, for the second problem, we propose a top-down algorithm named TDMthat constructs each MSubtree by visiting nodes of the subtree constructed based onan RKN set level-by-level from left to right, such that to avoid visiting as manyuseless nodes as possible.Fourthly, for the third problem, we propose a generic top-down subtreeconstruction strategy named "TDx", where"generic" means that our method can beused to get either one of MSubtree, RTF and TMSubtree results and our method isindependent of the query semantics, i.e., SLCA orELCA;"top-down" means that ourmethod constructs each restrict subtree by visiting nodes of the corresponding pathsubtree level-by-level from left to right, such that to avoid visiting as many useless nodes as possible. Then we take parallel processing strategy for our method toimprove the overall performance.At last, the experimental results show that our method is much more efficientthan existing ones according to various metrics.
Keywords/Search Tags:xml, keyword search, ELCA, SLCA, Subtree Reslut, parallel
PDF Full Text Request
Related items