Font Size: a A A

The Design And Implement Of Search Engine System On The Campus Networks Use Python-based Technology

Posted on:2016-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:D W GengFull Text:PDF
GTID:2308330479451047Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of digital campus, campus network information showing explosive growth, find and locate information become more difficult, the current commonly used search method is to use custom search of general search engines to find information, but due to general search engines often can not include the latest release of the news, and some Second-level domain of the campus network or directly using the IP address as the access to the site is not easy to general search engines, so the use of traditional general search engine to search for information on the campus network, become more and more inconvenient.To solve these problems, this paper uses Yanshan University campus network for the study, based on the analysis of the principle of the core module and running processes of search engines,this paper implemented a prototype Python-based search engine for the campus network in the Linux platform.First, this paper gives a brief workflow of search engine and introduces some key technologies, then analyzes the current widely used BM25 model of search engines. Through the use of Scrapy which is an open source crawler framework based on Python, Beautiful Soup which is a page parsing library, search engine crawlers modules then be completed, and this paper point out the original URL deduplication approach of Scrapy framework can lead to serious memory consumption when crawling the site for a large-scale. So this paper propose a method of using a Bloom filter for URL deduplication as a improvement plan.Secondly, this paper use Whoosh which is a full-text indexing and searching library to develop the index and search module of this prototype search engine.This paper proposed the use of open source jieba word segmentation function components to improve Chinese word segmentation ability for Whoosh. This paper using Flask framework which is based on the Python language to implement the user interface, allowing users to use the campus search engine through a web end.Finally, this paper tested the prototype system, and the test results of the prototype system, slightly better than general search engines.
Keywords/Search Tags:Campus Search Engine, Scrapy, Whoosh, URL deduplication, Chinese word segmentation
PDF Full Text Request
Related items