Font Size: a A A

Based On The Theme Of The Html Tags Crawler Design And Realization

Posted on:2010-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:T WangFull Text:PDF
GTID:2208360275483759Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Crawler is an indispensable constituent and essential technique for search engine. It is a system which automatically extracts and downloads web pages from internet for the search engine. General search engine's crawler usually crawls from several seed URL links, while subject-based search engine's crawler, besides the functions of the general search engine's crawler, can also distinguish the links and the contents of the web pages. It does not aim to the max coverage, but aim to catch web pages concerned with particular subject, filter irrelevant ones and save the web pages caught into the database for inquiry. The subject-based crawler has become a hot research topic in the filed of search engine technology and exerted profound influence on the search of professional fields.The thesis first introduces the relevant techniques of the subject-based crawler. After this, the fundamental principles and workflow of the general crawler and the subject-based crawler are introduced, and their differences analyzed. Then the thesis introduces the web page search strategy and parsing algorithm.As for the determination of the relevance of the web pages'subjects, the thesis first introduces the traditional parsing algorithm based on texts and explains its flaws. Then the thesis proposes a parsing algorithm for the determination of the relevance on the base of"HTML tags", which according to different HTML tag structures, assigns different weights to texts with different types of tags to ensure the accuracy of the subject classification. In practice, the weights of the HTML tags can be altered to achieve better results.As for the design of the subject-based crawler, the thesis analyzes the overall design of the system and introduces the system design and realization of the subject-based crawler. The overall workflow of the system and the classification of the sub-modules are analyzed and the specific design and realization of the sub-modules are introduced. The thesis also analyzes the comparatively important techniques in the design of the sub-modules to ensure the low coupling between sub-modules and high execution efficiency of the crawler in the process of realization. The"HTML-tags" method are used to improve the accuracy of subject classification and the recall rate of the web pages.Finally, the thesis analyzes the crawling data of the subject crawler to demonstrate that this subject-based crawler can improve the search accuracy to some extent.
Keywords/Search Tags:topic focused crawler, search engine, HTML tags, anchor text
PDF Full Text Request
Related items