Font Size: a A A

Research On Sequence-based Indexing And Query Processing Technology For Uncertain XML

Posted on:2015-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2298330422990189Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
XML is a standard for data and was released for use in February1998by W3C.XML is becoming the standard for data interchange in the world of information since itis a subset of SGML and is from W3C. XML is involved in all related data storage, datainterchange areas, such as data storage in Web application, application configurationfile, data sharing between applications, etc. The objective world is complex, and so wehave to face some uncertain data in data processing. Use XML to store uncertain datahas become the current trend for the development of XML technology, as well as theadvantages of XML itself. Storing uncertain data in the form of XML with probabilisticinformation is called uncertain XML and query for uncertain XML has become the hotspot of the current study.At present, there are binary structure join and holistic matching in uncertain XMLtwig pattern matching. Binary structure join affects the efficiency of query seriouslysince it produces a lot of useless intermediate results. It is not convenient forprobabilistic threshold filtering because query process is too centralized in holisticmatching and can not use probabilistic threshold filtering efficiently to improve queryefficiency. In view of the problems in uncertain XML twig pattern matching at present,sequence-based matching is applied to uncertain XML query in this thesis. Sequence-based uncertain XML twig pattern matching algorithms called PrTRIM and H-PrTRIMare proposed by improving LCS-TRIM algorithm.The uncertain XML document has additional information on probability incomparison with ordinary XML document. Therefore, information on probability needsto be processed correctly in the query. In this thesis, we set up an index called PSI byextracting structured information and content information from the uncertain XMLdocument. Information on probability can be processed correctly by PSI such asrecognition of exclusive distribution nodes and calculation of the probabilities of queryresults. Subsequence matching and structure matching in query also need to use PSI. Some probabilities of query results in uncertain XML are too low and have nopractical value. It can filter out the results with low probabilities in query byprobabilistic threshold. A probability value called probabilistic threshold is given in thequery. The significance of probabilistic threshold is that the probabilities of queryresults are required to be greater than or equal to the given probability threshold. In thisthesis, probabilistic threshold filtering can be carried out three times in a query. It canensure that query results in keeping with probabilistic threshold and improve queryefficiency at the same time.The experiment carries out by comparing PrTRIM and H-PrTRIM. It includesthree aspects, that is, the query statements effect on query efficiency, probabilisticthreshold effect on query efficiency and documents size effect on query efficiency. Atlast we analyzed the experimental results. The results of the experiment show that theefficiency of the H-PrTRIM algorithm is close to PrTRIM algorithm in view of smalldocuments and simple structure query statements. But its efficiency is still higher thanPrTRIM algorithm. The H-PrTRIM algorithm is more efficient than the PrTRIMalgorithm in view of large documents and complex structure query statements. To sumup, H-PrTRIM algorithm has advantages in the case of large documents and complexstructure query.
Keywords/Search Tags:Uncertain XML, Sequence, Twig pattern, Probabilistic threshold
PDF Full Text Request
Related items