Research And Implementation Of The Topic Web Crawlers

Posted on:2012-09-25

Degree:Master

Type:Thesis

Country:China

Candidate:J Lin

Full Text:PDF

GTID:2178330335452649

Subject:Computer Science and Technology

Abstract/Summary:

With the information age's developing gradually, various types of information which widely distributed on the Internet have affecting all aspects of human life. Today, people can check all types of target information with browsing webs by themselves. At the same time, thousands of information existing on the Internet is in a state of high-speed expansion, which protrudes the problem of how to get the target information conveniently by browsing webs.With the development trend of information diversity, General search engine, which facilitates people to check information on Internet, also has some shortcomings, such as low precision, stale contents and uneven distribution of information. Topics search engine is a new research method which provides valuable information resources and related services to the specific areas, groups of people or needs. As the part of information collection of topics search engine, the topic crawl grabs the related webs of users'interest.This thesis is divided into six chapters to analyze the design and implementation of the topic crawl. The first chapter describes the development process of the search engine and the role of crawl in the search engine. It also analyzes the status and the significances of the research. The second chapter is the theoretical basis of researching on the topic crawl. We first introduce the relevant theories of the research engine. And then, based on finding the differences and the features between these two crawls, we can focus on the architecture and the basic working principles about the general crawl and the topic crawl. The third chapter explores the research and improvement of the key technologies about the area of the topic crawl which includes the extraction of text features, search strategies and the skill of filtering web pages. It also proposes the improvement method of the PageRank based on the subject relevance. The forth chapter has a general analysis about the system design and implementation of the topic crawl which consist of some modules about the grab page, the web analytics, Chinese word segmentation and the URL management. In the fifth chapter, we will reveal the interface and the operation details of the topic crawl system and we can show the experiment progress based on this system. Then from above experiment results and data, we can demonstrate the rationality and the effectiveness of the research. The last chapter has a summary about the previous sections and proposes the limitations and the innovations of this thesis.Experimental results show that the topic crawl has better harvesting rate with stable operation. It also can reduce the time and the storage space and also update the webs in time. In addition, it provides the higher search precision and less redundant information to users.

Keywords/Search Tags:

Topic Crawl, Search Strategy, Relevancy Model, Precision, Recall

Related items

1	Research And Implementation Of Subject-oriented Dual-bound Web Page Crawling Methods
2	An analysis of search failures in online library catalogs
3	Study On Improved Best-First Algorithm About Focused Crawlerâ€™s Search Algorithm
4	Crawl Schedule Research For Real-time Vertical Search Engine
5	Research On Model Of Hot Topic Opinion Mining In Virtual Communities
6	Research On Topic Web Page Crawling Strategy For Vertical Search Engine
7	Crawl Technology Research For Real-time Vertical Search Engine
8	Research And Implementation Of Topic-Oriented Seach Engine Based On Lucene
9	The Design Of The Model Of Natural Language Processing And Intelligent Search Engine
10	Research And Implementation On Key Techniques Of Topic Search Engine