Font Size: a A A

Research And Implementation On A Web Page Classification System

Posted on:2014-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2248330398472114Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid popularization and popularity of Internet, billions of web sites and web pages supply users splendid accesses and a great number of information resources, their practically contents every themes people might imagine. Based on the above background, implementing classification of wed page by themes and set up a corresponding URL classification database, meanwhile, generating classification information resources become to a significant study. On the one hand, it can be used to filter and remove bad net pages, thereby play an important role in purifying network environment, or in accordance with the security policy and users’intentions to realize web access controlment On the other hand, it can provide users with classification information directory, realize web page classification management and the Internet information recommendation, to provide users with quicker and more efficient query results, so that to achieve the purpose of improve the access to information and information processing quality. Since the main body of web gapes are text description, so that, at present the most popular web page classification technologies are mostly position text classification as research direction, through a reasonable design of web presentation in the meanwhile using text classification algorithm in order to realize web page classification.According to report, at present, all overseas professional safety equipment provider (e.g. McAfee, Blue Coat, Websense, etc.) have their own online real-time query classification results web platform. On the contrary, domestic safety equipment providers have not yet paid attention to provide a real-time online inquirable web page platform for classification results database of their own "green network" business. At the same time, in order to present the classification database of "green network" while providers promoting "green network" business, it becomes necessary to provide users an inquirable system that based on web page classification results platform This research study and preliminary realized a web interface for web page classification system in the context of project’s demands of safety equipment provider, at the same time for the purpose of to achieve the service quality standards of overseas professional network security equipment provider. The main achievement of this study is designing and realizing a based on B/S structure of web page classification system. This study is on the basis of classification effect ideal SVM classifier, on the meanwhile it is utilizing LAMP (Linux+apache+mysql+PHP) web development platform for the purpose to research and realize the web page classification system.This study mainly completed several objectives as following:1. This paper analyses the background of this project of web page classification, the present situation of this research and the direction of this research.2. Systematically researches and analysis on the process of web page classification of the key technology and related theory, including some text categorization pretreatment technology, such as web pretreatment (web page denoising) technology and word segmentation, feature selection, text representation methods etc.3. Does demands analyze for web page classification system. Analyzing function demands, performance requirement and its interface style.4. Web page classification scheme design and realization of the system have been analyzed and chose in this study. Beside of this, this paper also studied and analyzed the advantages ofNutch, chose the implementation scheme which based on the Nutch crawler data acquisition module.Researches and analysis the advantages of SVM. Designs solution of the classification module which based on the LIBSVM. Based on the SVM algorithm classification process, complete classification system the main function module implementation scheme design. Utilize Nutch crawler collects a certain scale web data, after formatting processing, use the web classifier, which comes from artificial labeled web training LIBSVM, to classify these web data. Then integrate the generated classification results database on LAMP architecture development web platform, so that it can be completed that web page classification system to be realized overallIn the end, summarized and prospected on the work of this paper.
Keywords/Search Tags:web page classification, web page purification, featureselection, text classification, support vector machine (SVM)
PDF Full Text Request
Related items