Font Size: a A A

Research On Web Data Mining Technology Based On XML

Posted on:2009-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:D W XueFull Text:PDF
GTID:2178360272479442Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the Internet has become the important resources of information transmission and share. The features of Web data such as semi-structured, heterogeneous and mass make traditional data mining technology are not directly applied to Web data sources. Web data mining means that extracting a potential, useful model from the Web documents or Web activities. Because of the advantages as structural and expansibility of XML, research on XML combined with Web data mining has also became popular.In this thesis, the Web data extraction technology was researched firstly, a Web data extraction method based on the expanded DOM tree was proposed. The visual and link features were added to the nodes of DOM tree, and then the repeat degree and novelty degree of the nodes could be calculated. By calculating the novelty degree of the sub tree between the similar pages, object data could be automatically identified and extracted. The extracted results were saved as XML documents.Secondly, the XML document clustering was researched, the tree's structure of XML document was transfered to a level structure, the elements at different layers were endued different value.The clustering algorithm computed the level similarity between a XML document and existing clusters, and groups the XML document to the cluster with maximum level similarity. The relationship between layers was reflected in level structure, XML document's tree structure was simplified so that making the time cost of similarity calculation to be reduced.Finally, the algorithms were verified by experiments, and the results were analyzed, the advantages and weaknesses of the algorithms were discussed.
Keywords/Search Tags:Web data mining, Web data extraction, XML document clustering, expanded DOM tree, level similarity
PDF Full Text Request
Related items