Font Size: a A A

Studies On Dimension Reduction And Classification Methods For Web Mining

Posted on:2006-07-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:J T SunFull Text:PDF
GTID:1118360182983600Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The purpose of Web mining research is to extract novel knowledge from huge amount of Web data and to build highly effective Web mining systems. The basic approach is to apply data mining methods on the data collected from World Wide Web (WWW). This thesis focuses on dimension reduction and classification methods for Web mining tasks such as Web-Page classification, Web-Page summarization and personalized Web search, etc. Several Web mining algorithms were put forward in this work. The main contributions include:1) A novel dimension reduction method, Supervised Latent Semantic Indexing (SLSI), was proposed to represent documents for text classification tasks. Compared with traditional LSI approach, SLSI has the advantage of grasping the semantic concepts from document collections, as well as utilizing the discriminative information among different categories. It was shown that the SLSI approach led to drastic dimension reduction and it maintained the classification accuracy as well.2) A 3-order dimension reduction model, CubeSVD, was proposed for mining of the click-through data collected on search engine servers. The click-through data is usually very sparse and contains multi-type objects, among which there may exist complicated relations. The proposed CubeSVD approach is based on Higher-Order Singular Value Decomposition (HOSVD) technique. Experimental results indicated CubeSVD was able to capture the hidden relations among these objects, which were used to improve the personalized Web search.3) A novel Web-Page summarization algorithm, ALSA, was proposed based on dimension reduction techniques. The main idea of ALSA was to extract human knowledge on query usage from the click-through data. Furthermore, using the click-through data and the manually annotated Web pages, a thematic lexicon construction method was put forward for Web-Page summarization.4) A composite kernel optimization method, GECKO, was proposed for Web-Page classification tasks. Since it was common for Web-Pages to contain heterogeneous features, composite kernel combination method was used to leverage these features. In this work, the kernel combination problem was optimized by solving a generalized eigenvalue problem. Next, the optimized kernel matrix was used to train a classification model. Itwas shown that the GECKO algorithm has good generalization performance.5) Implicit links constructed from the click-through data were studied for Web-page classification tasks. When a user uses search engine, he/she may click on several pages after submitting a query. Thus implicit links between Web pages can be constructed. In this work, both implicit link construction methods and virtual document building approaches based on implicit links were proposed. Furthermore, two kinds of classification algorithms were used to compare the implicit links and the hyperlinks defined in this work. Experiment results indicated that implicit links could be used to improve the Web-Page classification.Several proposed algorithms were used to help develop a Web mining prototype system: WebME (Web Mining Environment), which was one part of the national 973 project hosted by the data mining group. The work researched in this thesis was also validated to be quite useful in real applications.
Keywords/Search Tags:Web Mining, Web-Page Classification, Dimension Reduction, World Wide Web, Click-Through Data Mining
PDF Full Text Request
Related items