Studies On Dimension Reduction And Classification Methods For Web Mining

Posted on:2006-07-01

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J T Sun

Full Text:PDF

GTID:1118360182983600

Subject:Computer application technology

Abstract/Summary:

The purpose of Web mining research is to extract novel knowledge from huge amount of Web data and to build highly effective Web mining systems. The basic approach is to apply data mining methods on the data collected from World Wide Web (WWW). This thesis focuses on dimension reduction and classification methods for Web mining tasks such as Web-Page classification, Web-Page summarization and personalized Web search, etc. Several Web mining algorithms were put forward in this work. The main contributions include:1) A novel dimension reduction method, Supervised Latent Semantic Indexing (SLSI), was proposed to represent documents for text classification tasks. Compared with traditional LSI approach, SLSI has the advantage of grasping the semantic concepts from document collections, as well as utilizing the discriminative information among different categories. It was shown that the SLSI approach led to drastic dimension reduction and it maintained the classification accuracy as well.2) A 3-order dimension reduction model, CubeSVD, was proposed for mining of the click-through data collected on search engine servers. The click-through data is usually very sparse and contains multi-type objects, among which there may exist complicated relations. The proposed CubeSVD approach is based on Higher-Order Singular Value Decomposition (HOSVD) technique. Experimental results indicated CubeSVD was able to capture the hidden relations among these objects, which were used to improve the personalized Web search.3) A novel Web-Page summarization algorithm, ALSA, was proposed based on dimension reduction techniques. The main idea of ALSA was to extract human knowledge on query usage from the click-through data. Furthermore, using the click-through data and the manually annotated Web pages, a thematic lexicon construction method was put forward for Web-Page summarization.4) A composite kernel optimization method, GECKO, was proposed for Web-Page classification tasks. Since it was common for Web-Pages to contain heterogeneous features, composite kernel combination method was used to leverage these features. In this work, the kernel combination problem was optimized by solving a generalized eigenvalue problem. Next, the optimized kernel matrix was used to train a classification model. Itwas shown that the GECKO algorithm has good generalization performance.5) Implicit links constructed from the click-through data were studied for Web-page classification tasks. When a user uses search engine, he/she may click on several pages after submitting a query. Thus implicit links between Web pages can be constructed. In this work, both implicit link construction methods and virtual document building approaches based on implicit links were proposed. Furthermore, two kinds of classification algorithms were used to compare the implicit links and the hyperlinks defined in this work. Experiment results indicated that implicit links could be used to improve the Web-Page classification.Several proposed algorithms were used to help develop a Web mining prototype system: WebME (Web Mining Environment), which was one part of the national 973 project hosted by the data mining group. The work researched in this thesis was also validated to be quite useful in real applications.

Keywords/Search Tags:

Web Mining, Web-Page Classification, Dimension Reduction, World Wide Web, Click-Through Data Mining

Related items

1	Data Mining Research In Web Information Retrieval And Classification
2	Classification Algorithm Of Data Mining
3	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets
4	Application Of Data Mining In Intrusion Detection System
5	Based On Web Content Mining, Web Page Classification And Filtering Research And Applications
6	Research And Application On The Decision Tree Classification Algorithm Of Data Mining
7	Chinese Web Page Classification Based On Web Page Features
8	The Research Of The Clustering Mining Based On The Web Usage Data Preprocess
9	Dimension reduction algorithms in data mining, with applications
10	Graph based click-stream mining for categorizing browsing activity in the World Wide Web