Font Size: a A A

Characteristics of HTML in the Deep Web

Posted on:2016-01-06Degree:M.SType:Thesis
University:University of Maryland, Baltimore CountyCandidate:Eckert, MatthewFull Text:PDF
GTID:2478390017976298Subject:Computer Science
Abstract/Summary:
This paper explores the HTML characteristics of the deep web by gathering HTML tag frequencies on web pages using three different web crawling techniques. The first web crawling technique used the most popular websites listed by Alexa as the seed for the web crawler and randomly selected a sample of web pages to include in the statistics. The second web crawling technique consisted of web pages gathered from randomly generating shorten URLs and visiting pages that the shortened URLs redirected to. The third web crawling technique traversed the deep web going through .onion web sites and domains by randomly generating a IP. Statistics from these web crawling techniques are gathered and compared in this paper.
Keywords/Search Tags:Deep web, Web crawling, Web pages, Randomly generating
Related items