| As the rapid development of network technology, the amount of XML data is increasing in high speed, expecially in the fields of publishing network data, exchanging data among many organizations, and E-commerce. XML has been the standard of data representation, data storage and data exchanging. In the application of XML data identification and integration, the technology of entity identification of XML data is in great demand. At present, in the research of entity identification technology of XML data, the main methods are based on the distance measure and similarity functions, and researchers usually ignore the optimization of entity identification of XML. But in the real world, on the one hand, different sources always have different ways of data representation, and the XML data are usually dirty, the two similar xml data is not necessarily the same entity, the two xml data represented the same entity may be not similar. On the other hand, different sources contain many irrelevant entities, in the process of entity identifying, there is much useless cost ,and it has much optimization space.This paper proposes the method based on the semantics rules for entity identification of XML data, and the optimization solution based on the double-clusters. We first give the idea of"tree-similarity structure", it uses the tree node's semantics and XPath semantics, the structure is two comparabile nodes connected by comparation operators and XPath limitation. Second, infers some"identification-tree-similarity structure"to match the entities according to the tree-similarity structures and reasoning rules. Identification-tree-similarity structure can conform the matching quality when the sources are dirty. Third, we optimize the data source's scale, we first build index for each xml tree of two XML data sets, then according to the index similarity , put the similar tree into the same cluster, if the tree does exist in any clusters, we will not do any operations to it. At last, Experimental results show that the method in this paper outperforms existing algorithms in efficiency without accuracy loss, and the effect of optimizaton method is better. |