Font Size: a A A

Keyword Based Garbage Pages Discrimination Research

Posted on:2016-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:M M DuFull Text:PDF
GTID:2308330479490443Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
At present, the search engine has become the main channel for Internet users to obtain information. But there are some garbage pages in the search engine. These garbage pages not only is a waste of the search engine resources, but also affect the search experience of Internet users. Garbage pages have the following characteristics: a lot of irrelevant links jump exist in pages; web contents are disorder and cannot provide useful information; there is SEO cheating.The present research is mainly using Natural Language Processing and machine learning method to distinguish these garbage pages. The semantic analysis requires a large corpus and a complicated tagging work in advance. The method of machine learning also needs to be processed by the natural language processing technology in advance. Natural language processing takes long and is difficult to handle. This research aims to establish a series of indicators to describe a web page and distinguish garbage pages by statistical discrimination method according to these indexes to avoid natural language processing and machine learning troubles. In this paper, a set of indexes is set up to describe a web page from the whole dimension, text dimension and link dimension. Use Fisher discriminant, logistic regression and Bayes discrimination, three commonly used statistical discriminant methods, to distinguish garbage pages. Validate the rationality of index set up to describe web pages according to the validity and accuracy of the discrimination method.In order to verify the operability of using statistical method to discriminate spam pages, the research uses computer to analyze the web pages automatically. Segment the page text by Chinese word segmentation technology and then count the numerical values of each index.
Keywords/Search Tags:statistical discrimination, Garbage page, index system, keywords
PDF Full Text Request
Related items