Font Size: a A A

Design And Implementation Of The Campus Vertical Search Engine Based On Scrapy

Posted on:2021-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:W MaFull Text:PDF
GTID:2428330602986160Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of campus digitization,the amount of campus information on the Internet is also increasing.It is more difficult for users to retrieve relevant campus information.At present,most teachers and students on campus use the search function on the school website to query the required information.However,the general search engine contains a wide range of information categories,covering a wide range of fields.It is difficult to search the subject.Some campus websites are used in the local area network,and it's information cannot be included by the general search engine.In response to the above problems,this article takes the campus website as the research object,and designs a campus vertical search engine based on Scrapy based on its structured characteristics.The search engine is composed of three important functional modules: page download,index retrieval,and search query.The main purpose is to provide convenient and fast professional search services for teachers and students on campus to promote the development of a digital campus.The research work in this paper includes:(1)The article develop a personalized web crawler program based on the Scrapy framework,and analyze Scrapy's own method of deleting duplicate links.When facing the situation that it will consume a lot of memory,it carry out the task of large-scale website information crawling.Adopting the Bloom filter to the Scrapy framework to improve the ability of the crawler to delete duplicate links.Then combined with the experience summarized from the actual application,the corresponding solution is proposed for the phenomenon.The crawler program is forbidden to be accessed by the website server due to frequent access to the target website.(2)Improving the PageRank algorithm increases the ranking quality of search results.Author research and analysis of PageRank and HITS respectively.Because PageRank has the characteristics of higher calculation efficiency and larger amount of calculation data,this paper uses PageRank as the sorting algorithm.In addition,according to PageRank sorting results,there are shortcomings such as bias to the original page,average division weight value,and theme drift.Adding time influencing factors to improve the page score;adding weight value influencing factors to improve the weight value of the out-of-chain web page;adding topic relevance Influencing factors to reduce the phenomenon of "topic drift" in search results.(3)Then,author use the Whoosh search library to build the index,and introduce Jieba word breaker in the index retrieval module to improve the word segmentation ability.At the same time,using Flask design the search query module of the search engine.By entering a query sentence,you can quickly get the search results with high topic relevance.It will provide users with good search services.(4)Finally,it test and analyze the system.By testing the Bloom filter applied to the Scrapy crawler program,the result can greatly reduce the memory space occupied by the program when running;by testing the improved PageRank algorithm,the result can optimize the sorting quality of the retrieval results.
Keywords/Search Tags:Vertical search engine, Scrapy, PageRank algorithm
PDF Full Text Request
Related items