Font Size: a A A

Study On Clustering For XML Document Collection

Posted on:2016-09-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J LiuFull Text:PDF
GTID:1228330467498644Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
XML document is a typical semi-structured data, XML is the Extensible MarkupLanguage (eXtensible Markup Language) abbreviation. In1998, the World Wide WebConsortium (W3C) develop the XML standards and proposed XML format, XML1.0standard, and also proposed a document type definition DTD (Document TypeDefinition) standards. XML data have general properties of semi-structured data suchas hierarchical feature, self-describe feature and dynamic variability characteristics.With the development of computer network, the semi-structured data format suchas XML format is widely used now, how to mining the interest potential knowledgefrom massive semi-structured database is a hot resrearch topic.In the field of XML documents dataset data mining, XML documents dataclustering problem is one of the key research. The main problem of XML documentclustering is how to evaluate the distance of XML document and integrating XMLdocuments with similar characteristics grouped into a cluster, the main application forXML document data sets with similar characteristics is data analysis. Before XMLdocument data clustering, how to effectively and accurately measure the similarity oftwo XML documents or XML document data between articles (distance) is to besolved, when considering the contents of XML documents in the field of informationand knowledge, correct the XML document data set to measure the degree ofsimilarity of each element becomes more complicated. It can be said, XML documentdata similarity measure algorithm directly affect the level of data quality XMLdocument clustering results.This paper summarizes the current proposed algorithm of XML documentclustering. First, a brief introduction to the core issue of clustering XML documents,that is the problem of the XML document similarity measure. Then list the XML datasimilarity measure method belong to the tree edit distance similarity measure,information retrieval similarity measure, and other similarity measure respectively. Then, in the third chapter, we propose an XML document clustering methodbased on hierarchical data. This method is called CXLI (Clustering XML with LayerInformation). CXLI method first establishes the structure of the XML document table,using the table structure to eliminate duplication and nested structure of XMLdocuments. Then consider the XML document layers made basic information editingconstraints. Further, consider using a dynamic programming method givesinformation between layers of an XML document similarity measure. Similaritymeasure during XML, XML data is hierarchical XML similarity factor. During XMLsimilarity measure, should be banned at different levels of sub-tree insertion anddeletion. CXLI method can be used for all applications XML similarity measure.Finally, condensed type hierarchical clustering method is used to cluster XMLdocument data sets. By clustering experiments ACM Sigmod datasets and artificialdatasets show that the proposed method is effective.Then, in order to further improve the accuracy of clustering results presentedXML document clustering method based on boosting theory. Boosting first discussedto improve the clustering quality (especially the weak clustering algorithm) of thebasic principles. Then, put a named ICBQ algorithm to effectively improve the qualityof clustering XML document data clustering methods. Experiments show that ourmethod has good efficiency, and has just the right rate, can effectively improve theXML-based document clustering results Boosting theory, experimental results showthat the data in terms of real data sets or in artificially generated on set, use ICQBmethods can make Nierman way clustering methods and results Dalamagas Flesca hassignificantly improved.
Keywords/Search Tags:Data mining, Semi-structured data, Document clustering, Layer, Boosting
PDF Full Text Request
Related items