Font Size: a A A

Large-scale Webpage Classification Algorithm Based On Spectral Hashing Research And Implementation

Posted on:2017-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:D D TianFull Text:PDF
GTID:2348330536467475Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the information age,the Internet is widely used in the work life with its advantages such as convenience,quickness and abundant information and so on.At the same time of bringing convenience to people,all kinds of bad information is filled with it.If it is not controlled and supervised,it will certainly cause damage to the growth of teenagers and national security,harmony,safety and stability.Network service management and control which is based on web service classification is one of the effectively ways to supervise Internet information.Its core technology is web page classification technology which is widely researched by vast number of scholars.Based on the above background,relying on the project of Development and Reform Commission,this article focuses on the research of large-scale web pages classification.Based on current web page classification technology,we improved the existing web page classification methods,designed a large-scale Chinese web page classification algorithm based on spectral hashing and conducted realization and test on the designed algorithm.The main research content of this article is as follows:First,a method based on keyword matching is proposed to pre-classify web pages.Through the research of web page structure features,we find out that the category attribute of web pages is closely related to the text information in the <head> tag item.After the pretreatment segmentation of web pages,we extract the words set in <head>tags and match it with the pre-classification keywords table;if the matching is successful,the classification result is output directly.This method does not need the steps of feature selection,quantitative representation of web pages and classification algorithm,and so on.It is only the matching of character string;thus,it can greatly improve the classification efficiency.Second,we draw on the method of comprehensive weight calculation and apply it to the choice of direction;thus,we put forward CW-FS feature selection method.This method includes the distribution situation of feature items within a class and between classes,the position of features items in web pages and the characteristic word length in the consideration range of weight.Thus,it can choose the feature items which contain a large amount of information and have strong distinguishing ability.Third,high web vector dimension is the main reason that affects the efficiency of web page classification.This article proposes that we perform dimension reduction on the original web page vector by spectral hashing in order to reduce the overhead of classification operation and improve the efficiency of web page classification.The experimental results show that the proposed method can improve the efficiency of web page classification significantly under the condition of small accuracy loss.In the end,we combined the above optimization scheme,designed and realized a large scale Chinese web page classification algorithm based on spectral hashing.By comparing the experimental results with the KNN algorithm,it is proved that the classification algorithm proposed in this paper can significantly reduce the classification operation time overhead and memory overhead under smaller classification accuracy loss conditions and the classification efficiency was increased significantly.
Keywords/Search Tags:KNN, spectral hashing, web page classification, method improvement
PDF Full Text Request
Related items