The Design And Implement Of Search Engine System On The Campus Networks Use Python-based Technology

Posted on:2016-03-22

Degree:Master

Type:Thesis

Country:China

Candidate:D W Geng

Full Text:PDF

GTID:2308330479451047

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of digital campus, campus network information showing explosive growth, find and locate information become more difficult, the current commonly used search method is to use custom search of general search engines to find information, but due to general search engines often can not include the latest release of the news, and some Second-level domain of the campus network or directly using the IP address as the access to the site is not easy to general search engines, so the use of traditional general search engine to search for information on the campus network, become more and more inconvenient.To solve these problems, this paper uses Yanshan University campus network for the study, based on the analysis of the principle of the core module and running processes of search engines,this paper implemented a prototype Python-based search engine for the campus network in the Linux platform.First, this paper gives a brief workflow of search engine and introduces some key technologies, then analyzes the current widely used BM25 model of search engines. Through the use of Scrapy which is an open source crawler framework based on Python, Beautiful Soup which is a page parsing library, search engine crawlers modules then be completed, and this paper point out the original URL deduplication approach of Scrapy framework can lead to serious memory consumption when crawling the site for a large-scale. So this paper propose a method of using a Bloom filter for URL deduplication as a improvement plan.Secondly, this paper use Whoosh which is a full-text indexing and searching library to develop the index and search module of this prototype search engine.This paper proposed the use of open source jieba word segmentation function components to improve Chinese word segmentation ability for Whoosh. This paper using Flask framework which is based on the Python language to implement the user interface, allowing users to use the campus search engine through a web end.Finally, this paper tested the prototype system, and the test results of the prototype system, slightly better than general search engines.

Keywords/Search Tags:

Campus Search Engine, Scrapy, Whoosh, URL deduplication, Chinese word segmentation

PDF Full Text Request

Related items

1	The Application And Research Of Chinese Word Segmentation And Web Deduplication In News Vertical Search Engine
2	The Campus Network Core Search Engine Technology - Chinese Word Segmentation
3	Research And Development Of Digital Resources Search Engine Technologies On Campus Network
4	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
5	Research On Chinese Word Segmentation Of Search Engine
6	Applied Research Of Chinese Word Segmentation In Agricultural Vertical Search Engine
7	The Research And Realization Of Chinese Word Segmentation System Applies In Chemical Professional Search Engine
8	Design And Implementation Of The Campus Vertical Search Engine Based On Scrapy
9	Study And Implementation On Chinese Word Segmentation Algorithm Of Search Engine Based On Nutch
10	The Research And Application Of Chinese Word Segmentation Technology In Search Engine