| With the rapid development of the Internet, XML has become the Internet’s most popular language for data exchange and storage, how to extract valuable information from a large number of XML documents is currently one of the hotspots.In the study of XML document clustering methods, a study said the idea is to improve the model of an XML document, in order to get a more efficient method of calculating the similarity of XML documents. Currently there is an XML document for a variety of similarity calculation models, such as SET/BAG model, VSM model, tree models, which have a variety of similarity calculation method in each model. This article describes the basics of text clustering and its application, analysis of commonly used text clustering algorithm and its advantages and disadvantages, introduces some of the basic similarity calculation method for XML document similarity calculation model and the basic model, analyzes the advantages and disadvantages of various similarity calculation methods.This paper presents an improved method of similarity calculation SET/BAG model is based. This method converts each node of the XML document as an object (by the object name, the parent object, and the object is a collection of attributes with respect to the weight of the composition of the right of its parent), this can be a more complete expression of the structural information of the XML document, and by right to adjust the weight duplicate nodes to reduce their impact on the similarity calculation. This article on a real data set with manual data collection experiments, respectively, using the recall and precision of the clustering results are evaluated, similar to the method tree edit distance method compared with the node by comparison, simulation results show that the article similarity calculation method based on the following SET/ BAG improved model proposed clustering can get good results. |