Characteristics of HTML in the Deep Web

Posted on:2016-01-06

Degree:M.S

Type:Thesis

University:University of Maryland, Baltimore County

Candidate:Eckert, Matthew

Full Text:PDF

GTID:2478390017976298

Subject:Computer Science

Abstract/Summary:

This paper explores the HTML characteristics of the deep web by gathering HTML tag frequencies on web pages using three different web crawling techniques. The first web crawling technique used the most popular websites listed by Alexa as the seed for the web crawler and randomly selected a sample of web pages to include in the statistics. The second web crawling technique consisted of web pages gathered from randomly generating shorten URLs and visiting pages that the shortened URLs redirected to. The third web crawling technique traversed the deep web going through .onion web sites and domains by randomly generating a IP. Statistics from these web crawling techniques are gathered and compared in this paper.

Keywords/Search Tags:

Deep web, Web crawling, Web pages, Randomly generating

Related items

1	Research On Large-scale Crawling On Web Forums
2	Query Selection in Deep Web Crawling
3	Classification System Based On The Theme Of Information Acquisition In The Pages
4	Research On Crawling Deep Web Information
5	Research On Algorithm Of Crawling Ajax Dynamic Web Pages Based On User Interface State Changes
6	Domain-Oriented Incremental Deep Web Crawling
7	Research On Discovering Domain-Specific Deep Web Entries Based On Focused Crawling And Ontology
8	Crawling the Web: Discovery and maintenance of large-scale Web data
9	Research On Efficient Web Information Crawling Strategy
10	Research On Generating SQL Statements Through Natural Language Based On Deep Learning