Font Size: a A A

Research On Key Techniques Of XML-based Content Routing

Posted on:2007-06-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:T WangFull Text:PDF
GTID:1118360215459715Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet techniques, there are lots of Event-driven Applications such as Content-based publish/subscribe system, selective dissemination of information, content-based XML routing and news distribution. In these applications, a stream of XML documents is sent from a set of data producers to a set of data consumers. Consumers subscribe to the data by means of filters, and then receive a copy of all contents that satisfy the filters. This style of routing is called content-based routing, because the contents are routed based on their contents, and not based on any destination address. However, the existing content-based technologies suffer many problems on the efficient filtering method and the support to the heterogeneity events.XML has become the de facto standard of data exchange over the Internet, due to that XML is characterized by self-described, scalable and convenient for exchange. In this thesis, supposed that XML as publishing events, whereas XPath as multiuser subscriptions, some key techniques of content routing were focused on.In order to deal with the XML publishing events, a novel HXFA method is presented with the optimized rewriting method and then, the theoretical analysis is given. Finally, thousands of HXFAs are combined as a filtering engine. The proposed method shows the satisfactory results compared with the YFilter engine on efficiency and expansibility.The advantages and defects of the existing XML similarity models are analysed, based on which this paper extends the Vector Space Model and proposes a novel H_path model using ontology and supports. In addition, the constructing algorithm and complexity analysis are given. The approach at first extracts the frequent sequences from the document collection as the features, and then judges the semantic features between the tags of XML documents using ontology. Furthmore, the model combines repeated node and path through the semantic features. Finally, the distance calculation based on supports is put forward. Compared to the tree edit script model, this model has not only the description of each document but also the priority of the time expense.Based on the H_path model, the clustering method using improved PSO is given. Firstly, the document collection is mapped into the problem space of the particle model. Then, the CIP method is applied for clustering. Furthermore, weighing the time and accuracy factors, the mixed clustering method based on PSO is applied into the XML category to improve the clustering constringency and accuracy.When extracted from a large scale of heterogeneous documents collection, the H_paths have been dimension-reduced to some extent; however, the high dimensionality curse still exists. Aimed at the problem, a novel preprocessing strategy is proposed. Independent Component Analysis is applied to reduce the dimensionality of document matrix. Then, document vectors are clustered on this reduced Euclidean Space spanned by the independent components. It has two merits: the method can at first delete the correlative redundancy and find the underlying latent variables of XML structures to improve the quality of the clustering, and secondly reduce dimensionality to compress the search space with low cost.Finally, the architecture of Publish/subscribe System is presented. We can also find how these proposed key techniques works in this system.
Keywords/Search Tags:XML, Content Routing, Publish/subscribe, Particle Swarm Optimization, Hedge Automata
PDF Full Text Request
Related items