Font Size: a A A

Design And Implementation Of A Distributed Crawler System For Campus Recruitment Themes

Posted on:2019-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ZhangFull Text:PDF
GTID:2428330572959982Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the increasing number of college graduates for the past few years,college graduates become the main force for finding jobs through the Internet.The employment of graduates has attracted great attention from the society.At present,there are too many recruitment websites on the Internet.Many recruitment information has redundant information,low query efficiency,and low information reliability.It takes a lot of time to identify untruthful information,filter duplicate information,which reduce the efficiency of finding jobs.Through the deep study the related technology and algorithms of crawler system,the system is installed on the Hadoop distributed platform,running the job recruitment information on campus recruitment to solve the problem of recruitment of information efficiently crawling,and meet the requirements of campus recruitment this topic;through the parallel crawling of multiple recruitment sites,campus recruitment information to solve the scattered and miscellaneous problems of campus recruitment job information;study the regularization to filter the theme of irrelevant url links,in order to achieve the crawling range is limited to three recruitment website domain.Calculate the similarity of the web title and the feature words to reduce the PageRank score of false recruitment information.It finally achieves the purpose of crawling recruiting information in a comprehensive way and reduces the amount of information,and provides a convenient interactive interface.The main work is as follows:(1)Design of the distributed crawler system for campus recruitment themes.In order to obtain campus recruitment information,it is necessary to crawl the campus recruitment information link of each recruitment website.In order to improve the efficiency of job information extraction,a parallel computing framework with high efficiency is used to achieve the parallel extraction of job information.In order to realize the theme of campus recruitment,filter the URL links crawled.In order to facilitate the user to query the campus recruitment information,provide search query service function.According to the above functional requirements,the system modules are mainly divided into a crawler module,an index module,and a retrieval module.(2)Implementation of the distributed crawler system for campus recruitment themes.The reptile module uses Nutch open source crawler framework,uses regularization-based filtering method to perform URL filtering,and uses its plug-in mechanism to carry out secondary development,and achieves the PageRank score of the post information reliability that fuses similarity of title feature words.The index module uses the Solr framework to index crawled campus recruitment data,and configures the IK-Analyzer tokenizer for Solr to preprocess campus recruitment webpage documents to improve query accuracy.The search module design implements a user interaction interface based on JSP+CSS,which is convenient for users to query.
Keywords/Search Tags:Campus recruitment, Distributed crawling, Information reliability score
PDF Full Text Request
Related items