Font Size: a A A

Clustering XML Documents Based On Density And Fuzzy Set

Posted on:2013-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:J S GongFull Text:PDF
GTID:2248330371983594Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the information that peopleaccessing are much larger than before. At the same time, people cannot handle thesegreat amounts of information, leading to the occurrence of contradiction between thedesire for knowledge and backward processing technology.Data Mining is a non-trivial process which extracts useful and potential valuableinformation and knowledge from large, inconsistent, noisy, complex dataset. Peoplenote that the combination between Data Mining technology and other areas withdevelopment of Data Mining. Then, this combination enlarges the applications of datamining, and then promotes development of Data Mining too. The eXtensible MarkupLanguage (XML) is a standard published by the World Wide Web Consortium (W3C)organization. It is not only a description Language, but also a standard for dataexchanging at network. XML data mining have obtained fruitful research at home andabroad, including Mining Frequent pattern, Mining association rules, Classificationand Clustering etc.This paper mainly research on clustering XML documents, aiming at thatpropose a more reasonable method of clustering XML documents. First of all, minefrequent pattern as feature set, because XML documents are high-dimensional data.And then calculate the similarity as a measure of distance between the two documentsusing SIM algorithm. Finally, cluster XML documents based on the similarity matrixusing DFC algorithm. The main work of this paper is the following aspects:1)Analyze the advantages and disadvantages of various types of frequent sub-structureand the corresponding algorithm, use the frequent embedded subtree of as an XMLdocument feature set;2) Propose a calculation method of the similarity of the XMLdocument, which is suitable for clustering XML documents;3) Propose a clusteringalgorithm, which is a combination density clustering method and fuzzy clusteringmethod. This clustering method achieves complementary advantages between the softand the hard. The result of this method is more reasonable than before.This paper is divided into six parts. The details are as follows:(1) Chapter one is the introduction. This part briefly describes the backgroundabout clustering XML documents and significance of this topic; describes the research status of the technology at home and abroad; gives the main idea and the structure.(2) Chapter two is the basics of XML. This part illustrates the XML technology,including of XML character, XML document elements and so on; introduces XMLdocument tree.(3) Chapter three is frequent pattern mining. Frequent pattern mining containsthree species: frequent itemset mining, frequent sequence and frequent subtree.describes classical mining algorithms. Subtree is divided into bottom-up subtree,induced subtree and embedded subtree.(4) Chapter four is about XML similarity. Firstly, introduce data preprocessing;Secondly, give a new method of similarity based on XML document`s structureproperty, SIM algorithm.(5) Chapter five is XML documents clustering. This paper proposes a newmethod for clustering XML documents, DFC algorithm, which is a combination ofsoft clustering and hard clustering.(6) Chapter six is the conclusion and the future prospect. This part sums up thepaper work, puts forward the problems and deficiencies of this paper. I will continuethe research at work and research in the future.Owing to the limits of my ability, this paper exist many deficiencies, demandingfurther research. Firstly, this paper only considers the static XML document mining,Ignore the dynamic characteristics of the XML document, for example: frequentlychanging structure, frozen structure and so on. Secondly, if the paper fully considerthe different weights of each feature, the clustering result will be more reasonable.Finally, the clustering results need to be improved. I will pay more attention to theabove aspects and improve the research in this paper.
Keywords/Search Tags:Data Mining, clustering, XML Documents, frequent subtree, similarity, density, fussy set
PDF Full Text Request
Related items