Font Size: a A A

XML Data Mining Based On Frequent Pattern Tree

Posted on:2010-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y S WuFull Text:PDF
GTID:2178360275994227Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is defined as a non-trivial process of extracting valid, novel, potentially useful, and ultimately understandable patterns from a large number of incomplete, noisy and ambiguous data. It is an efficient method of resolving the problem of "data rich-information poor".XML is a simple, very flexible text format derived from SGML. XML has become the standards for data representation and exchange over the Internet. More and more datas are stored in XML format, and a lot of information and various of patterns are hidden in the datas. Hence, there have been increasing demands of efficient methods that extract potential and valuable d information from XML data, namely XML data mining.However, as a kind of semi-structured data, XML data are a huge amount of complex and heterogeneous data modeled by trees, and cannot be easily mapped into a relational framework. Thus, we cannot directly apply to XML data traditional data mining methods for relational databases, such as Apriori. Hence, it is a important challenge to develop efficient and scalable methods for XML data mining.This paper first introduces the basic theory of the traditional data mining, the basic theory of XML, the features of XML documents and technical specifications related to XML.Second, it introduces the concepts related to frequent subtree mining, and some of the existing frequent subtree mining algorithm.Third, it proposes a novel algorithm PDOM, based on the analysis of the FREQT and Freqttree algorithm, which are the frequent subtree mining algorithm. The algorithm adopts the technology of the rightmost expansion. Then it uses a method of recursive updating the set of candidate nodes to reduce the number of candidate nodes. Thus, the number of the candidate patterns is small. And, it adopts incremental method to compute the support of candidate pattern trees, which improves the efficiency of algorithm. It proves the correctness of the algorithm PDOM through theorem, and analysis the algorithm through the experiment.At last, taking into account the tree structure of xml, the paper proposes an algorithm named BFPC for classifying xml documents based on frequent pattern tree. It makes use of both document content and structure. First, it use tf * idf method to extract the representative of characteristics from the non-structural information that is the content of xml in other words. Then, it uses the frequent pattern tree algorithm to extract frequent pattern tree of each class to be the representative of the class and give some weights to each frequent pattern tree. Simultaneously, we propose a pattern tree match algorithm-Pmatch to implement pattern tree match by rightmost match set. In the testing phase, it uses Pmatch algorithm and keyword matching method to calculate the scores of the test document, and judges which class it belongs to. Experimental results show that BFPC algorithm has higher accuracy.
Keywords/Search Tags:XML mining, Frequent pattern tree, Pattern tree matching
PDF Full Text Request
Related items