Research On The XML Document Clustering And Performance Evaluation

Posted on:2016-10-08

Degree:Doctor

Type:Dissertation

Country:China

Candidate:T N Ding

Full Text:PDF

GTID:1228330467998643

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of computer network technology and the increase ofInternet users, semi-structured data are widely used. The eXtensible MarkupLanguage XML which propoesd by world wide web consortium W3C is a typicalsemi-structured data. XML format for the gradation, biographical, dynamic variabilitycharacteristics such as widely used, since Microsoft Office2007, Microsoft Officestarted using an xml-based format stored Office OpenXML format document, and inthe Linux operating system, OpenDocument also use OpenOffice format stored Officedocuments based on XML format, according to the world wide Web consortium, inthe next generation of Web, has pointed out that the XML format to replace HTML asexchange standard format of the page.In the face of huge amounts of XML document data, how to mining from largeXML document database users interested in knowledge become one hot research topicin data mining field. The XML document clustering problem is one of the XMLdocument data mining research. XML document clustering problem mainly studieshow to have similar characteristics of the XML document as a cluster, it is mainlyused for data sets with similar characteristics of XML document data analysis.In this paper, we study a kind of clustering method for static XML document data.For static XML document data sets, this paper proposes a frequent pattern treestructure based on XML document dataset of document clustering method. First of all,this paper puts forward to encode XML document tree structure encoding (coding treestructure). Then, using XML data mining frequent patterns in the data characteristics,use the cosine similarity measure method and the condensing type of hierarchicalclustering method for clustering XML documents data set. Frequent patterns becauseXML data set is a subset of the set of the original XML document data set, so frequentpatterns in XML document significant data set, the XML document similarity measureof time consumption was significantly decreased. Through computer real experimentshows the algorithm has good clustering results and a good scalability.Then, this paper studied a dynamic XML document clustering method of data set. According to the characteristics of dynamic XML document data sets first, TDOMmodel is proposed for version with time parameters, TDOM model to record the XMLdocument data in the process of dynamic change history change process, and then putforward the definition of XML document significant frequent change mode, putforward in TDOM method of the centralized data mining frequent changessignificantly, and finally, puts forward a dynamic XML documents based onsignificant changes frequently structure data gathering method. Real experimentshows that this algorithm can pass through the computer dynamic characteristics ofthe dynamic data set of XML document clustering task and has good clustering resultsand good extensibility.Performance evaluation is one of open problems in data mining and machinelearning fields. We note that nearly all the existing evaluation measures ignore thepredicted probabilities which are greatly significant in the process of clustersâ€™evaluation. In this paper, we construct a weighted confusion matrix to reflect theinformation on predicted probabilities. In addition, based on the weighted confusionmatrix, traditional evaluation measures, such as accuracy, precision, recall, F-measure,are redefined to taking predicted probabilities into account. Finally, properties of there-written evaluation measures are investigated. Experimental results show that there-defined evaluation measures are superior to traditional ones in terms ofdiscrimination.In this paper, we proposed the ROC based measure for the performance ofclustering models. First of all we propose the concept of weighted correct pair map.Then based on the weighted correct pair map, we proposed a new evaluation measure.The attractive features of the measure are that it is insensitive to imbalanced classdistributions and discriminating enough. Experimental results demonstrate that theproposed measure is reliable. The work presented in this paper may stimulate newresearch in classification model designing, such as designing new optimization-basedclustering or ranking models.

Keywords/Search Tags:

Data mining, XML, Document clustering, Performance evaluation, Confusionmatrix, ROC curve

PDF Full Text Request

Related items

1	The Application Of Data Mining Technology In The Performance Evaluation Of Higher Vocational College Teachers
2	Design And Implement Of Web Document Clustering System
3	Research Of Clustering Analysis And Its Application In Document Mining
4	Study On Clustering For XML Document Collection
5	Study Of ATR Algorithm Performance Evaluation Method Based On ROC Curve
6	High performance text document clustering
7	The Application Of Multilayer Document Categorization In User's Preference Mining And Processing
8	Research On Web Data Mining Technology Based On XML
9	Web Document Clustering Based On Knowledge Granularity
10	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework