Font Size: a A A

Research And Implementation Of The Strategy-Extensible Search Engine

Posted on:2006-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:B S LiuFull Text:PDF
GTID:2168360155474113Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Traditional keyword-based search engines that try to collect and index all the Web pages usually return many results that users don't care. A solution to this problem is to collect the relevant information by using machine learning algorithm and interaction with users. This is called focused crawling. It can be used to improve the relevance and uptodate of the query results. On the other hand, general-purpose search engines cannot collect vast information in hidden-web, and so much useful information cannot be searched in those engines. Hidden-web crawlers are specially designed for those web contents. They can heuristically construct appropriate queries to the hidden-web and collect the pages returned. Hidden-web crawlers give a good solution for improving the recall of search engines. Recall and precision are two major guidelines for evaluating the performance of search engines. In this thesis, we study focused crawling and hidden-web crawling technology that can improve these two guidelines, and design and implement an integrated search engine called Webob. Firstly, we design the Webob-Crawler. This thesis presents the viewpoint that a search engine combined the focused crawling and hidden-web crawling technology can achieve ideal recall and precision. And based on architecture of traditional search engines, we design an open crawler architecture that support focused crawling and hidden-web crawling. We also address the concept of task and construct two task-evolvement models. This thesis describes the structure and algorithm of Webob-Crawler. Secondly, in this thesis, we study text categorisation algorithms and implement an extensible text-classifier with supporting many algorithms including Rocchio's, Na?ve Bayes and KNN algorithm, etc. Full-text index and user query interface are also important components in search engine. In this thesis we study the theory of information retrieval, and describe a full-text index system that support Chinese word separation and document abstract generation based on Lucene. Finally, in this thesis we introduce the integration of crawler, text classifier and full-text indexer, so as to construct a full search engine. This search engine can be used for study and development of web mining and searching.
Keywords/Search Tags:search engine, information retrieval, focused crawling, hidden-web crawling
PDF Full Text Request
Related items