Font Size: a A A

Design And Implementation Of Top-Scholar Talents Database System Based On Distributed Crawler

Posted on:2019-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:J W LiuFull Text:PDF
GTID:2428330545497842Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In this informational and big data era,new knowledge is the main driving factor influencing the development of society.Scholars are the main body during academic knowledge researching.In many colleges and universities in China and abroad,especially the first class institutions and first class disciplines,there are a large number of scholars with considerable academic influence.The work about collection,preservation,statistics,display their academic achievements and establishment of a complete and accurate talent pool of top scholars allows users can analyze their scientific research academic capabilities more comprehensively and objectively The work also make users can manner and grasp their academic output and academic influence.The focus of talent introduction and training in major universities all have practical demand and significance.The top academic talents database system designed by this paper is designed to meet the requirements of talent introduction of the Personnel Department of Xiamen University.According to the statistical lists of institutions and disciplines in China and abroad,it is crawled from the Internet to collect as much accurate and complete information as possible for scholars.Distribute scholar information in major universities and colleges around the world,and store them in the database.Select them and sort them in a reasonable weighted order,and finally display them to users.This article first introduces the background of the project,elaborates on the research of distributed crawler,including Kafka technology of message queue,cache memory database technology,deduplication strategy,distributed storage,and so on.Then it discusses the design and implementation method of the whole system.The innovation of this system in terms of business,compared to the information of the entire network of traditional search engines,the system locks the first author's name information by deciding the literature search of the university list and subject as the keywords.The name and organization conducted further data collection work,which effectively reduced the scope of crawling and improved the efficiency of crawling.In addition,in terms of technology,the system optimizes the traditional Bloom filter algorithm based on its own business requirements,and uses Kafka's partitioning mechanism to implement the scheduling of multi-threaded crawler task,and reduce the scope of crawling by using different crawl sorting methods.A distributed system that can efficiently capture a large number of Internet scholars' information is implemented.Finally,in this paper,we designed experiments about the problems of memory optimization,optimal crawl ordering and gave the analysis of experimental results.
Keywords/Search Tags:Academic talent database, Distributed crawler, Deduplication strategy
PDF Full Text Request
Related items