Font Size: a A A

Research On Related Problems Of Web Mining

Posted on:2010-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:L L KuangFull Text:PDF
GTID:2178360278458942Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, inevitably, "Information Explosion" has come into being. People urgently need a type of technique which can acquire knowledge from the Web rapidly and effectively, then Web Mining has arisen accompanying with this background and has become the research hotspot. Different from traditional data mining, Web mining faces semi-structured and unstructured data, which brings difficulties in mining. Recently, XML has gradually become a standard organizing and exchanging data on the new generation Internet. XML can combine the structural data from different sources, which effectively solves the above problem. So, How to mine valuable information from XML data has become a research subject with exploratory and challenging significance.The main contributions of the thesis are listed as follows:Firstly, the fundamental concept and properties about Web mining and XML are discussed, and the complexity of the Web mining and XML's application in the Web mining are analyzed.Secondly, the transformation of the HTML-XML in the Web Mining is studied.The shortages of existing transformation algorithm are analyzed, and a HTML-XML transformation model based on DOM and JTidy is designed and implemented. The test shows that the novel transformation model is feasible and high adaptable.Finally, the fundamental concept and properties about frequent sub-tree mining are studied, and the Tree Growth (TG), an embedded subtree mining algorithm based on pattern growth principle, is discussed stressly. The shortcomings of the TG algorithm are analyzed, and an improved algorithm is proposed aiming to solve the problems existed in TG algorithm. Based on the improved algorithm, and by introducing the idea of the partition, a new algorithm, named Partition Tree Growth (PTG), is put forward. Theoretical analysis and simulation tests show that PTG algorithm can deal with the memory problem while mining large dataset, and work effectively.
Keywords/Search Tags:Web mining, HTML-XML transformation, Tree Growth algorithm, Frequent subtree, Pattern growth, PTG (Partition Tree Growth) algorithm
PDF Full Text Request
Related items