Domain-Oriented Incremental Deep Web Crawling

Posted on:2013-04-24

Degree:Master

Type:Thesis

Country:China

Candidate:Z X Zhang

Full Text:PDF

GTID:2248330374482607

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the Internet technology grows rapidlyã€science and technology knowledge develops every day, The peopleâ€™s demand of network exploration is growing. In this case, the number of website and page in the Web grows rapidly with a kind of explosive trend. The whole web can divided into Surface Web and Deep Web according to the way of visiting information, while Deep Web not only include much more and richer information, but also the information of Deep Web have better structured and more thematic than the Surface Web. With the growth of the demands the analytical applications, such as market intelligence analysis, public opinion analysis, Deep Web data must be integrated and some useful knowledge can be analyzed and mined in these integrated data. Deep Web crawling is the first step of Deep Web data integration, can provide data support for Deep Web data integration. Only on the basis of establishing large amount of web pages, by extracting and integrating data, the information finally obtained can be more accurately and satisfy the usersâ€™demands. In order to obtain lots of web pages, this requires improving crawling efficiency in the limited resources, not only to ensure the quantity of the web pages, but also the freshness of the web pages. Deep Web incremental crawling has very important application value and practical meaning. It improves crawling efficiency and saves much time and energy.There are many research fields which are related to the Deep web, such as data source discovery, Deep Web crawlingã€extractionã€data fusionã€‚ Although recently some exporters have done lots of research in these respects, there are some problems which have not been resolved. Now lists several problems about Deep web crawling.1. A search form includes many description labels and form elements. Adopting what kind of approach to parse these labels and elements to make them match accurately is a problem.2. After extracting search form, there will be many form elements and labels when search form is decomposed. Because form design lacks of unified development standard, the form elements and labels which are decomposed donâ€™t have one-to-one relationship. Although people may understand their relationship easily, machine canâ€™t. How to automatically accurately match form element and labeã€make the attribute which is formed successively accurately correspond to the table attribute of backend database and query data records efficiently is a challenge.3. When users fill form and submit form, after visiting backend database, users can obtain pages. But if submit duplicated form or similar form in meaning in the short term, result must be duplicated pages. The goal is to obtain the pages which are new or the content of which has changed, while the page which donâ€™t change arenâ€™t in the goal, Only in this way can improve crawler.According to the above problems, the paper makes system and profound research. The main research works and contributions are as follows.1. For a given search form, according to the designerâ€™s design intent and the using habit of users, LEX is used to analyze search form and descript the form element and descriptive tag with specific expression, at last every form element will be provided some alternative descriptive tags.2. This paper proposes a way of combining machine learning method to matching form elements and tags, this combination includes two steps. Every step will verify that what kind of machine learning method is more suitable and more accurately filter out those wrong tags through experiments according to the characteristics of the candidate tags. Finally some exceptions will be processed.3. This paper proposes a way of Deep Web incremental crawling based on the classification of the URL and introduces the crawler module, form extraction module, form submitting frequency calculator module and so on. According to the content Deep Web are divided into list page and leaf page. The crawler mainly incremental crawl leaf pages, while list pages are mainly used to make auxiliary crawling.

Keywords/Search Tags:

Search Form, Deep Web, Incremental Crawling, URLClassification, Change Frequency

PDF Full Text Request

Related items

1	Research On Crawling Deep Web Information
2	Design Of A Parallel Web Crawling System
3	The Study On Incremental Crawling Of Web Fourms
4	Research On Efficient Web Information Crawling Strategy
5	The Theme Of The Search Engine Web Spider Search Strategy Study
6	Change Detection For High-resolution SAR Images With Salient And Incremental Deep Learning
7	The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines
8	Research And Implementation Of Web Information Automatically Crawling In Vertical Search
9	Research On Subject-Based Incremental Parallel Crawling
10	Crawler And Incremental Update Strategy Research In Deep Web