Font Size: a A A

Keywords Filtering On XML Data

Posted on:2012-12-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:C J ZhangFull Text:PDF
GTID:1118330371465413Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
XML is a standard for information exchanging and storing over Web. In many im-portant applications, for example RSS, XML data are used and users submit keywords to express what they want and the system filters XML stream and returns matched XML segments to users. With the bombing of web data, it's a valuable problem filtering useful information from massive data.This paper has studied two problems, efficiency and Model, of SLCA-based key-words filtering in deterministic XML and probabilistic XML. For deterministic XML, previous index-based XML keyword search method isn't suitable for XML keywords fil-tering. Effective one-scan algorithms for XML filtering are proposed in the thesis. At the same time, using MapReduce distributed computing technology for massive XML data to get better keywords filtering efficiency is also discussed in this thesis. For probabilistic XML, the only previous work on keyword is based on probabilistic model prXML{ind,mux} which focus on independent and mutex relations between sibling nodes. Modeling more general dependency relations between sibling nodes and querying on the model are also meaningful problem.The contributions of this thesis are summarized as follows:●SLCA(Smallest Lowest Common Ancestor) semantic meaning on probabilistic XML is defined. The information comes from child nodes is defined as "tab". The opera-tors "·", "+", "×" are given on "tab". A keywords filtering method for probabilistic XML model YrXM{ind,mux}and PrXML{ind,mux}is given. Comparison with similar work on prXML{ind,mux},our method is more effective.●A new probabilistic XML model PrXML-BN based on Bayesian Network is pro-posed. The map from XML document to Bayesian Network is defined. SLCA semantic meaning is introduced into Bayesian Network to support SLCA-based keywords filtering. Two optimizations, nodes reducing and query results caching, are given to improve algorithm efficiency.●An effective portable SLCA computing service is designed for deterministic XML keywords filtering. It doesn't rest on any label approach method. It can process queries and get SLCA results while data are parsing without index. No redundant intermediate results are generated. So it can achieve more efficiency.●A distributed keywords filtering method is constructed to process massive XML data effectively. The filtering task is divided into small tasks by way of data partition using Hadoop. Considering that XML is semi-structured data, a strategy of XML data partition is given to avoid XML data to be divided by Hadoop transparently.A comprehensive research on XML keywords filtering is conducted in this thesis. A portable SLCA computing service is reported. And it is integrated into parallel system to solve the problem of data size. SLCA semantic meaning on probabilistic XML is defined. Keywords filtering on prXML{ind,mux}is studied. A new probabilistic XML model PrXML-BN based on Bayesian Network is proposed and SLCA-based keywords filtering is conducted on it. Information extraction and data quality of massive uncertain data will be studied in the future.
Keywords/Search Tags:XML Filtering, Probabilistic XML, Cloud Computing, Keywords Filtering, SLCA, Bayesian Network
PDF Full Text Request
Related items