Font Size: a A A

Web Page Classification Oriented To Web Personalization System

Posted on:2008-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:K F YuanFull Text:PDF
GTID:2178360215490587Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As a global-wide distributing information service center, World Wide Web collect mass information, most of which is saved by text-based pages of vast, miscellaneous and unorganized. Based on these pages, traditional Web service system, such as search engine, hardly can provide satisfactory service to every user. Web personalization technique is developed to solve this problem. It can improve user's satisfaction by providing custom-built service accordant with user's interest. In WWW, personalization system based on page content will have great application capacity.Webpage automatic classification is an important research area in data mining and a significant application of natural language processing technology. As a key technology of high practicality, Webpage classification technology, which automatically labels web pages by their topic, is one of important bases of web information retrieval and web personalization service. This dissertation does researches on technologies related to webpage automatic classification in web personalization service systems. Main content of the research and productions achieved is listed below:First, after analyzed the traditional feature extraction methods, thesis presents a composite webpage feature weighting method. This method which ground on HTML tag analyzing, integrates Chinese word length features for Chinese web environments, improved accuracy of page description. This is basis for a good classification result.Second, for large training corpus, dissertation proposes an improved classify arithmetic named Cluster-Tree Support Vector Machine (CT-SVM). During the process, this arithmetic effectively reduces the corpus by hierarchical clustering, and thus saves SVM training time in large corpus while provide a certain classification result.Third, dissertation applies text semantic similarity arithmetic to construction of SVM kernel function. This arithmetic, which called SHM, use How-Net word semantic calculation in computing similarity of feature words, use maximum-weight-matching of documents bipartite graph in computing similarity of pages. The SVM classifiers with SHM kernel appear more preferable performance to those with common kernel functions for integration of text semantic information.At last, dissertation gives results of validation experiments about the methods above. The classification result shows that in actual webpage corpus downloaded from Internet, the methods are effective. In many aspects such as web personalization service, knowledge extraction, news distribution, mail filtration and information supervisal, etc., the methods proposed in this dissertation have certain value in theory and application.
Keywords/Search Tags:Webpage Automatically Classification, Web Personalization, Support Vector Machine (SVM), How-Net Semantic Analysis, Hierarchical Cluster Tree
PDF Full Text Request
Related items