Font Size: a A A

A Novel Clustering Method For Dynamic XML Documents

Posted on:2013-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhangFull Text:PDF
GTID:2248330395959376Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of information technology to the data on the Web ofexplosive growth, how to the Web from huge data accurate and efficient get wantknowledge become popular in the research. Web mining is from Web information inget potential, valuable knowledge or model of the process, classification, clustering,such as Web mining feature selection as the main technology has been considerabledevelopment. Clustering analysis in Web mining occupies the important position,the so-called clustering is according to a certain similarity measure, according tocertain standards will set an object into several classes that similar object of similarbetween as far as possible, not the same as far as possible by different between objects.The clustering may come from each domain, including practical life in instance, alsoincludes in the theory exploration the research, even more contains in each discipline,such as mathematics, statistics, biology, as well as computer science and so on.Clustering as the Web mining pretreatment stage can be classified data to improvemining by the efficiency and precision rate.Web page with HTML text most forms of existence, but with the diversificationof Web data and complicated, HTML document has can’t satisfy the informationprocessing and information exchange request. XML is put forward by the W3Cstandards, because the flexibility, openness and the descriptive and othercharacteristics gradually become the mainstream Web data formats and exchangestandards. On the research of XML documents, needs a lot of related concepts andstructure definition, so as to obtain the relevant to the expression and calculationmethod. In the process of change in history document often do not change by thevirtue of the knowledge structure, this paper puts forward the expression method of frozen structure, use a set of frozen structure document vector model represents anXML document, and use the weighted Jaccard similarity coefficient as based on XMLdocuments in the process of the historical changes relatively stable frozen structure onthe XML document clustering. Through experiment, frozen structure can be based onXML clustering, clustering cluster after the XML document every with similar oftendo not change structure based on XML clustering system. We can achieve the deeplyresearch of the XML documents through the analysis of the clustering, thus mayperform from another angle in each domain practical work to apply and theexploration.First this paper related standard XML summarized analysis, and points out thatthe current document clustering field commonly used the deficiency of the clusteringalgorithm. According to various advantages and disadvantages of the algorithm, try tochoose one or several as clustering analysis algorithm scheme. Then focus on theXML document clustering of the key question-document similarity measure method,this paper analyzes the classic edit distance method and based on the edge of the setXML document similarity measure method, in this paper, the space vector model isproposed on the basis of tag and the path of combining the XML document vectormodel, based on the level of a document tree with vector characteristics of certainweight, can express XML element nested semantic information, through example inthe document similarity calculation and edit distance method and based on the edge ofthe set similarity measurement methods are compared, and the calculation resultsshow that the method has better is hard to document the distinction between pointsability. Put forward the definition of frozen structure are given, and the main problemsand frozen structure measurement method. Finally, we in the real documents set andartificial document set on the test, the test result to achieve the expected effect.
Keywords/Search Tags:XML, clustering, VSM, ensemble learning, FS
PDF Full Text Request
Related items