Font Size: a A A

Research And Application Of XML Document Classification Method

Posted on:2010-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:W TianFull Text:PDF
GTID:2178360302460803Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
XML (extensible Markup Language), as a common data exchanging and transmitting standard, contains rich information. Data mining on XML has become a new research focus of web mining.This article focuses on classification methods of XML document. The structure characteristics are very important features which text documents do not have, therefore most technologies and algorithms used in text mining are not suitable to XML mining because of the structure characteristics of XML documents. So, this article pays more attention to the structure characteristics of XML document. First, a model called Frecquency-Path model is proposed to express XML documents. This model not only preserves labels of correspondent nodes, but also provides frequency of same paths, so it can decrease the tree path model scale consumedly on the condition of not losing meaningful information. Second, on the basis of Frecquency-Path model, a similarity calculation method called WLCS (Weighted Longest Common Subsequence) is proposed. The longest common subsequence method is introduced for matching paths; position weight vectors which keep the position of nodes in mind are introduced for calculating the similarity of paths. Experiment results on true data set demonstrated the better recall ratio and accuracy than exsited methods. Third, a new vectorization method of the structure of XML document is proposed on the basis of Frecquency-Path model. When processing vectorization, an improved IG algorithm which is based on path frequency is introduced combining with WLCS. Finally, the research of the XML document classification method is applied in full-text searching system from Da Lian public security bureau.
Keywords/Search Tags:XML Document Classification, Structure Similarity, Position Weight Vector
PDF Full Text Request
Related items