Font Size: a A A

Focused Crawler Based On Ant Colony Research And Implementation

Posted on:2011-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:J G CuiFull Text:PDF
GTID:2178360308959168Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Internet has revolutionized people's thinking, life and habits. On the one hand, it makes it easier to access a variety of information, on the other hand, to the billions of web pages to find the exact information they need information such as "needle in a haystack", the general. Search engines generate completely changed the way people live, people can quickly find the information you want to find the ocean of information.But with the growing number of Internet information, as well as networking, storage and computing resources limited, the traditional search technology has been increasingly difficult to meet people's needs, its limitations have become increasingly prominent. Therefore, it was an urgent need for a more intelligent, more accurate, more professional search technology, online information will show up better. Then create a vertical search engine, it is to solve the limitations of traditional search engines as a potential solution.Vertical search engine is the development trend of information retrieval, the core - Theme crawler technology has become one of the focuses of current research. Crawl theme is traversing Web, but selectively crawl the page with a specific topic, and avoid crawling non-related pages. Using the theme crawling technology to narrow the search to a Web part, and have chosen specific areas of crawling (or themes) of the website, set up subject-oriented vertical search engines.Therefore, the theme of crawling to a large extent can save hardware and network resources, improve the precision of search results and quality assurance of real-time crawling.This paper introduces the basic theory of search engines, vertical search engine leads; second crawler technology theory on the subject has been studied with emphasis on the PageRank algorithm hyperlink analysis, topic-related concepts such as knowledge of reptiles; final guiding theme of the ant colony algorithm reptile-depth analysis of the relevant theory, which focuses on analysis of ant colony algorithm and its implementation using Java language, followed by analysis of server logs as well as on the Web log mining.Focused crawler search strategies currently the main difficulties are:(1)Subject to the Web crawler search space in the overall distribution of information resources is unknown, can not well predict the direction of crawling.(2)The subject of reptiles at this stage thinking mostly through analysis of the link anchor text and theme relevant content strategy to guide the reptiles crawling, do not have the "inspiring" to guide policy.(3)In order to give priority to high quality relevant web crawling, the researchers designed a number of heuristic strategies and related algorithms, although there's "heuristic" search strategy can make use of knowledge in certain areas of the distribution of information resources and make some kind of estimates of the extent by which to infer the approximate search direction, but this method of calculation capacity, and high computational complexity, is also not find it appropriate learning algorithm to guide the training process.This paper presents an ant colony algorithm based on the theme of crawler technology, mining from the Web log information in the user's browsing path of a group, so that it can better guide the theme of reptiles. Finally, experimental realization of a theme reptiles, compared with themes based on ant colony algorithm and the traditional theme of reptiles difference between reptiles and concluded the subject based on ant colony algorithm technology to better guide the reptile theme of reptiles.
Keywords/Search Tags:Vertical search, topics crawler, PageRank algorithm, ant colony, Heritrix
PDF Full Text Request
Related items