Font Size: a A A

Domain-Oriented Incremental Deep Web Crawling

Posted on:2013-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z X ZhangFull Text:PDF
GTID:2248330374482607Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the Internet technology grows rapidly、science and technology knowledge develops every day, The people’s demand of network exploration is growing. In this case, the number of website and page in the Web grows rapidly with a kind of explosive trend. The whole web can divided into Surface Web and Deep Web according to the way of visiting information, while Deep Web not only include much more and richer information, but also the information of Deep Web have better structured and more thematic than the Surface Web. With the growth of the demands the analytical applications, such as market intelligence analysis, public opinion analysis, Deep Web data must be integrated and some useful knowledge can be analyzed and mined in these integrated data. Deep Web crawling is the first step of Deep Web data integration, can provide data support for Deep Web data integration. Only on the basis of establishing large amount of web pages, by extracting and integrating data, the information finally obtained can be more accurately and satisfy the users’demands. In order to obtain lots of web pages, this requires improving crawling efficiency in the limited resources, not only to ensure the quantity of the web pages, but also the freshness of the web pages. Deep Web incremental crawling has very important application value and practical meaning. It improves crawling efficiency and saves much time and energy.There are many research fields which are related to the Deep web, such as data source discovery, Deep Web crawling、extraction、data fusion。 Although recently some exporters have done lots of research in these respects, there are some problems which have not been resolved. Now lists several problems about Deep web crawling.1. A search form includes many description labels and form elements. Adopting what kind of approach to parse these labels and elements to make them match accurately is a problem.2. After extracting search form, there will be many form elements and labels when search form is decomposed. Because form design lacks of unified development standard, the form elements and labels which are decomposed don’t have one-to-one relationship. Although people may understand their relationship easily, machine can’t. How to automatically accurately match form element and labe、make the attribute which is formed successively accurately correspond to the table attribute of backend database and query data records efficiently is a challenge.3. When users fill form and submit form, after visiting backend database, users can obtain pages. But if submit duplicated form or similar form in meaning in the short term, result must be duplicated pages. The goal is to obtain the pages which are new or the content of which has changed, while the page which don’t change aren’t in the goal, Only in this way can improve crawler.According to the above problems, the paper makes system and profound research. The main research works and contributions are as follows.1. For a given search form, according to the designer’s design intent and the using habit of users, LEX is used to analyze search form and descript the form element and descriptive tag with specific expression, at last every form element will be provided some alternative descriptive tags.2. This paper proposes a way of combining machine learning method to matching form elements and tags, this combination includes two steps. Every step will verify that what kind of machine learning method is more suitable and more accurately filter out those wrong tags through experiments according to the characteristics of the candidate tags. Finally some exceptions will be processed.3. This paper proposes a way of Deep Web incremental crawling based on the classification of the URL and introduces the crawler module, form extraction module, form submitting frequency calculator module and so on. According to the content Deep Web are divided into list page and leaf page. The crawler mainly incremental crawl leaf pages, while list pages are mainly used to make auxiliary crawling.
Keywords/Search Tags:Search Form, Deep Web, Incremental Crawling, URLClassification, Change Frequency
PDF Full Text Request
Related items