Research And Implementation Of The Strategy-Extensible Search Engine

Posted on:2006-10-29

Degree:Master

Type:Thesis

Country:China

Candidate:B S Liu

Full Text:PDF

GTID:2168360155474113

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Traditional keyword-based search engines that try to collect and index all the Web pages usually return many results that users don't care. A solution to this problem is to collect the relevant information by using machine learning algorithm and interaction with users. This is called focused crawling. It can be used to improve the relevance and uptodate of the query results. On the other hand, general-purpose search engines cannot collect vast information in hidden-web, and so much useful information cannot be searched in those engines. Hidden-web crawlers are specially designed for those web contents. They can heuristically construct appropriate queries to the hidden-web and collect the pages returned. Hidden-web crawlers give a good solution for improving the recall of search engines. Recall and precision are two major guidelines for evaluating the performance of search engines. In this thesis, we study focused crawling and hidden-web crawling technology that can improve these two guidelines, and design and implement an integrated search engine called Webob. Firstly, we design the Webob-Crawler. This thesis presents the viewpoint that a search engine combined the focused crawling and hidden-web crawling technology can achieve ideal recall and precision. And based on architecture of traditional search engines, we design an open crawler architecture that support focused crawling and hidden-web crawling. We also address the concept of task and construct two task-evolvement models. This thesis describes the structure and algorithm of Webob-Crawler. Secondly, in this thesis, we study text categorisation algorithms and implement an extensible text-classifier with supporting many algorithms including Rocchio's, Na?ve Bayes and KNN algorithm, etc. Full-text index and user query interface are also important components in search engine. In this thesis we study the theory of information retrieval, and describe a full-text index system that support Chinese word separation and document abstract generation based on Lucene. Finally, in this thesis we introduce the integration of crawler, text classifier and full-text indexer, so as to construct a full search engine. This search engine can be used for study and development of web mining and searching.

Keywords/Search Tags:

search engine, information retrieval, focused crawling, hidden-web crawling

PDF Full Text Request

Related items

1	Crawling and searching the hidden Web
2	Study On Focused Crawling Technique For Vertical Search Engine
3	Focused Web Crawling Technology
4	Design And Implementation Of A Focused Search Engine
5	Research And Application On Focused Crawling Search Engine Based On The Lucene
6	Research On Focused Crawling Technique For Vertical Search Engine
7	Design And Implementation Of User-customized Desktop Search Engine
8	Focused Web Crawling Strategy Based On Formal Concept Analysis
9	Research On Focused Hidden Web Crawler
10	Spider Crawling On Mobile Search Research And Implementation Strategy