Font Size: a A A

The Research Of Web Clasification Based On URL Features

Posted on:2012-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2218330338463484Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The internet provides a great deal of resources and information; however, they are separatedand difficult to be managed due to its wide distribution and high dynamics. Web pageclassification can effectively solve these problems. In the web page classification process, featureselection is one of the most important steps. The traditional features choosen from the text, anchortext, the title page text selection etc, which would consume more time. Meanwhile, featureredundancy and characteristic dimension are also common problems in web page classification.How to quickly identify web category and at the same time, improve the accuracy of classificationand features reduced-order became problems needed to be resolved.This paper systematically analysis the background, development situation and researchsignificance of web page classification, researching the key technology of web page classification,and on the basis of existing research results, mainly completed the following innovation:The URL is the unique identity of the web page, directly according to the URL characteristicsof web page classification web page of text processing can save when consumption. The paperanalysis the structure of URL, and puts forward the method to deal URL with using n-gram to getcharacteristics, the segmentation method of n-gram through getting a series of URL strings, makefull use of the information contained in the URL. Paper chooses weka tools to do classificationexperiment, through the choice of different n value contrast we can find that the time needed toextract classification than traditional text to be much faster, and can achieve higher precision.Experiment through the feature extraction of n-gram method and traditional URL featureextraction method, it is concluded that the comparative n-gram effect is better. And without therequirement of time, under combining the precondition of n-gram and text characteristic .Theeffect than separately used them.
Keywords/Search Tags:URL, web classification, feature selection, n-gram
PDF Full Text Request
Related items