Font Size: a A A

Parallel XML parsing

Posted on:2010-01-04Degree:Ph.DType:Dissertation
University:State University of New York at BinghamtonCandidate:Pan, YinfeiFull Text:PDF
GTID:1448390002477097Subject:Computer Science
Abstract/Summary:
XML has been widely adopted across a spectrum of applications. Its parsing efficiency, however, remains a concern and can be a bottleneck. With the prevalence of multicore CPUs, parallelization to improve parsing performance is one promising approach. The XML document can either be completely parsed to an in-memory tree after the whole XML document is available, or be parsed on the fly while the XML document is streaming in. Each alternative requires a different parallelization strategy. This dissertation thus investigates the parallelization approaches of both DOM-style XML parsing and SAX-style XML parsing.;For DOM-style parsing, we first figured out an overall solution to decomposing the XML document into well-formed fragments at well-defined points according to the output of an initial preparsing phase. Then, we focused on the parallelization of the preparsing stage, which is the major bottleneck of the DOM-style parsing approach. A meta-DFA approach for the parallelization of the preparsing stage is described. We further optimize the meta-DFA design into a new approach named SFT (Simultaneous Finite Transducer). The SFT approach greatly reduces the complexity of the generated meta-DFA, and thus simplifies all possible preparsing paths. We also developed a method to speculate on just some of starting machine states with high probability. With the parallelized preparsing stage, the parallel DOM-style parsing is thus much more scalable.;For SAX-style parsing, we first present an approach to process the incoming XML stream chunk by chunk. Each chunk will go through four stages. These stages combine the pipelined parallelism with the data parallelism to achieve speedup while still guaranteeing the sequential callback requirement of SAX-style parsing. With our investigation on the characteristics of SAX-style parsing, we further introduce a new method to build a streaming tree from the XML stream, and do SAX-style parsing through parallel traversal of this streaming tree. The streaming tree represents the XML document with each node representing an XML document's start element, end element, or character content, and thus also corresponding to each SAX callback event.;The experiments for the above approaches on our testing machine and XML benchmark documents show good speedup and scalability.
Keywords/Search Tags:Parsing, XML document, Approach, XML stream, Parallel
Related items