Font Size: a A A

Model Of Integrating With Web Quality Features For Webspam Detection And Model Verification

Posted on:2017-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:H L ZhuFull Text:PDF
GTID:2308330485988543Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web Spam detection model has been more concerned about the content and links features, and little concerned about the web quality. Openness of the Internet determines the web quality uneven, and some "inferior web pages" do not contain traditional spam features or information, but these pages provide us with few useful information, cannot be detected by traditional web spam detection method effectively. This paper proposes to provide an effective feature model of spam detection based on previous feature model of web spam detection, and integrated web quality features.It relates to validating the model, we need to build an experimental data set which is corresponding with the model from the actual line page. This paper crawls web pages from internet, extracts content, extracts features and artificial labels to construct final labeled datasets; at the same time, this paper needs an effective sorting verification method to verify the validity of the model. From the numerous documents sorting and classification method, we select the Rankboost algorithm to sort web page quality and detect spam pages. According to the idea of Rankboost algorithm, synthesize a plurality of weak ranking results, obtained higher accuracy results, In paper convert page ranking into pairs (pairwise) ranking. Using basic classification algorithm, it can draw the relative ranking between two pages. By feedback regulation of sorting error, update redistribution of samples weak rankers. Finally, the final results is the weighted sum of weak rankers, then detect spam pages by analyzing the results of the sorting. Experiments show that, this feature model can detect spam pages effectively.
Keywords/Search Tags:spam detection, web quality ranking, Rankboost algorithm, pairwise ranking
PDF Full Text Request
Related items