Font Size: a A A

Vertical Search Engine Crawlers System

Posted on:2009-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:J Q WuFull Text:PDF
GTID:2208360248453029Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and increase of WEB information, people have more difficulty in finding information in the information sea. Search engine can become the most popular services because it helps users in the vastness of the Internet to find information quickly. Finding information in the massive page, in accordance with the traditional method requires the user step by step-by-site directory to find, to spend a lot of energy and time, it is almost impossible to achieve the task.The explosive growth of Internet information, a few years ago the global search engine included only a few pages of 10 million, and has now reached some 1 billion. The increase in the number of pages is the decline in the quality of search services, the results of inquiries have been set is the massive level, as many as 100,000 of the results. There are a lot of information and refuse to repeat information. Users feel more and more difficult in a short period of time required to accurately filter the content. It is difficult to quickly find the information needed. Therefore, the search service needs refinement, the need to provide a more professional, more effective services.Vertical search engine provides a certain value of the information and related services for a particular area, a specific group of people or a specific needs. Vertical search engines mainly involves technology: crawler, structure of the Web information extraction technology or metadata collection, segmentation and indexing, information processing technology. This paper studies the vertical search engines crawler system, and develops the system..Network crawler (also called network spiders or network robot) via the web to find the link page. From a page (usually home) or a site, read the contents to find the Web address of the other links, and then through these links to find the address of other Web page, so it has been circulating until all pages or the site has been crawled. If the entire Internet as a Web site, crawler can crawle all the web-pages on this principle. Crawler system needs to use the technology distributed, concunency, link selection algorithm and links-elimination filter algorithm.Colored Petri Net(called CPN) is a level of high-level Petri net and one of the best tools to model and analyze distributed concurrent system. The model with CPN is executive and conducive to dynamic simulation. Color set of CPN place can be arbitrary complex data, greatly simplifying the complexity of the system. CPN is hierarchical structure and pages system and gradually refined from whole to local, coarse-to-fine. CPN is not only a graphical modeling tool, but also a formal mathematical tool. Crawler system is modeled with CPN and verified its correctness in this paper.However, the CPN is a develop tool being used to describe and analyze the system model, not the realization of computer tools. Because the ultimate goal of this paper is to develop an executive crawler system, we need to CPN model into a computer program. Crawler system is developed use object-oriented technology because the current main software development technology is object-oriented technology. UML modeling tool is most widely used in object-oriented system. UML is a well definition, easy to express, powerful and universally applicable Modeling Language. UML includes the field of software engineering of new ideas, new methods and new technologies. Its scope is not limited to support for object-oriented analysis and design, but support the whole process of software development from the beginning of requirements analysis. Extracting use case and providing use case diagram based on CPN model. The system static diagram, mainly the important class, is designed with use case diagram and the CPN model. And the key part of system is illuminated with state diagram.This paper use java as a tool for the realization of software, because java has good cross-platform characteristic. The system can be developed in the window and transplanted to run linux platform. The system use mysql database for data storage and is running linux platform. As data acquisition system of agricultural Vertical search engines project of Beijing DaZheng Language Knowledge Services Ltd, there are 92 sites to be crawled, news and information site on 82, the supply and demand Website 10. The crawlers open 10-thread crawl News Website, three threads to crawl supply and demand Website. The first full crawl, News Website category average hourly crawl 15,000 pages, such as supply and demand of 4,000 per hour to crawl the page, the daily average can crawl 400,000 pages (at speed faster). Apart from 10 days to complete Alibaba supply and demand information, the full crawl all sites, caught a total of 4.1 million pages. After a day of incremental updates, the target site issued by the information can be crawled within half an hour, about the daily updated 8000 data.
Keywords/Search Tags:Vertical Search Engine, Crawler, CPN, UML, Object-Oriented, Java
PDF Full Text Request
Related items