Design And Implementation Of A Distributed Crawler System For Campus Recruitment Themes

Posted on:2019-04-19

Degree:Master

Type:Thesis

Country:China

Candidate:J Q Zhang

Full Text:PDF

GTID:2428330572959982

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the increasing number of college graduates for the past few years,college graduates become the main force for finding jobs through the Internet.The employment of graduates has attracted great attention from the society.At present,there are too many recruitment websites on the Internet.Many recruitment information has redundant information,low query efficiency,and low information reliability.It takes a lot of time to identify untruthful information,filter duplicate information,which reduce the efficiency of finding jobs.Through the deep study the related technology and algorithms of crawler system,the system is installed on the Hadoop distributed platform,running the job recruitment information on campus recruitment to solve the problem of recruitment of information efficiently crawling,and meet the requirements of campus recruitment this topic;through the parallel crawling of multiple recruitment sites,campus recruitment information to solve the scattered and miscellaneous problems of campus recruitment job information;study the regularization to filter the theme of irrelevant url links,in order to achieve the crawling range is limited to three recruitment website domain.Calculate the similarity of the web title and the feature words to reduce the PageRank score of false recruitment information.It finally achieves the purpose of crawling recruiting information in a comprehensive way and reduces the amount of information,and provides a convenient interactive interface.The main work is as follows:(1)Design of the distributed crawler system for campus recruitment themes.In order to obtain campus recruitment information,it is necessary to crawl the campus recruitment information link of each recruitment website.In order to improve the efficiency of job information extraction,a parallel computing framework with high efficiency is used to achieve the parallel extraction of job information.In order to realize the theme of campus recruitment,filter the URL links crawled.In order to facilitate the user to query the campus recruitment information,provide search query service function.According to the above functional requirements,the system modules are mainly divided into a crawler module,an index module,and a retrieval module.(2)Implementation of the distributed crawler system for campus recruitment themes.The reptile module uses Nutch open source crawler framework,uses regularization-based filtering method to perform URL filtering,and uses its plug-in mechanism to carry out secondary development,and achieves the PageRank score of the post information reliability that fuses similarity of title feature words.The index module uses the Solr framework to index crawled campus recruitment data,and configures the IK-Analyzer tokenizer for Solr to preprocess campus recruitment webpage documents to improve query accuracy.The search module design implements a user interaction interface based on JSP+CSS,which is convenient for users to query.

Keywords/Search Tags:

Campus recruitment, Distributed crawling, Information reliability score

PDF Full Text Request

Related items

1	Research On Efficient Web Information Crawling Strategy
2	Design And Implementation Of Institution Recruitment Information System Based On J2EE
3	Design And Implementation Of Configurable Distributed WEB Information Crawling System
4	Design And Implementation Of Campus Recruitment System
5	The Web Implement Of Campus Recruitment System Based On Data Integration
6	The Design And Implementation Of Management System For Campus Recruitment Interview Plan
7	The Design And Implementation Of Campus Information System Based On Android Environment
8	Design And Implementation Of The Anhui Audit College Campus Recruitment System
9	Human Resources Recruitment System Based On Score Items Design And Implementation
10	The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines