Font Size: a A A

Research On The XML Document Clustering And Performance Evaluation

Posted on:2016-10-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:T N DingFull Text:PDF
GTID:1228330467998643Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of computer network technology and the increase ofInternet users, semi-structured data are widely used. The eXtensible MarkupLanguage XML which propoesd by world wide web consortium W3C is a typicalsemi-structured data. XML format for the gradation, biographical, dynamic variabilitycharacteristics such as widely used, since Microsoft Office2007, Microsoft Officestarted using an xml-based format stored Office OpenXML format document, and inthe Linux operating system, OpenDocument also use OpenOffice format stored Officedocuments based on XML format, according to the world wide Web consortium, inthe next generation of Web, has pointed out that the XML format to replace HTML asexchange standard format of the page.In the face of huge amounts of XML document data, how to mining from largeXML document database users interested in knowledge become one hot research topicin data mining field. The XML document clustering problem is one of the XMLdocument data mining research. XML document clustering problem mainly studieshow to have similar characteristics of the XML document as a cluster, it is mainlyused for data sets with similar characteristics of XML document data analysis.In this paper, we study a kind of clustering method for static XML document data.For static XML document data sets, this paper proposes a frequent pattern treestructure based on XML document dataset of document clustering method. First of all,this paper puts forward to encode XML document tree structure encoding (coding treestructure). Then, using XML data mining frequent patterns in the data characteristics,use the cosine similarity measure method and the condensing type of hierarchicalclustering method for clustering XML documents data set. Frequent patterns becauseXML data set is a subset of the set of the original XML document data set, so frequentpatterns in XML document significant data set, the XML document similarity measureof time consumption was significantly decreased. Through computer real experimentshows the algorithm has good clustering results and a good scalability.Then, this paper studied a dynamic XML document clustering method of data set. According to the characteristics of dynamic XML document data sets first, TDOMmodel is proposed for version with time parameters, TDOM model to record the XMLdocument data in the process of dynamic change history change process, and then putforward the definition of XML document significant frequent change mode, putforward in TDOM method of the centralized data mining frequent changessignificantly, and finally, puts forward a dynamic XML documents based onsignificant changes frequently structure data gathering method. Real experimentshows that this algorithm can pass through the computer dynamic characteristics ofthe dynamic data set of XML document clustering task and has good clustering resultsand good extensibility.Performance evaluation is one of open problems in data mining and machinelearning fields. We note that nearly all the existing evaluation measures ignore thepredicted probabilities which are greatly significant in the process of clusters’evaluation. In this paper, we construct a weighted confusion matrix to reflect theinformation on predicted probabilities. In addition, based on the weighted confusionmatrix, traditional evaluation measures, such as accuracy, precision, recall, F-measure,are redefined to taking predicted probabilities into account. Finally, properties of there-written evaluation measures are investigated. Experimental results show that there-defined evaluation measures are superior to traditional ones in terms ofdiscrimination.In this paper, we proposed the ROC based measure for the performance ofclustering models. First of all we propose the concept of weighted correct pair map.Then based on the weighted correct pair map, we proposed a new evaluation measure.The attractive features of the measure are that it is insensitive to imbalanced classdistributions and discriminating enough. Experimental results demonstrate that theproposed measure is reliable. The work presented in this paper may stimulate newresearch in classification model designing, such as designing new optimization-basedclustering or ranking models.
Keywords/Search Tags:Data mining, XML, Document clustering, Performance evaluation, Confusionmatrix, ROC curve
PDF Full Text Request
Related items