Font Size: a A A

Design And Implementation Of Recruitment Information Parallel Extraction

Posted on:2017-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:D WangFull Text:PDF
GTID:2308330503964114Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the increase in the number of recruitment website, it has been becoming an important method for the job seekers to find jobs. Every day there are thousands of recruitment information which are enormous and miscellaneous, posted on different recruitment website. Job seekers need query the website or search tools to obtain the required position information accurately and comprehensively. Query on various websites, it is difficult to fully obtain the required jobs and switching website workload. Compared to the former, The general search tools can obtain the comprehensive position information, but there are some problems such as redundancy and selection of information. The candidates need a special search tool to query position information.Because position information is dispersed and informative, the distributed system is used to parallel extract information,which can improve the speed of extraction. The regularized extraction method is applied to solve the issue of information diversity and chaos. In addition, it is efficient to find a method to filter the redundancy information. The above of the work can not only provide a friendly interactive interface to search the position information but also extract comprehensive and less redundant information. The main contents are as follows:(1)The overall structure of the position information parallel extraction system.The URL of the position is crawled from the recruitment website as the data source of the information extraction. In order to obtain the specific position information, the position information is extracted from the page according to the URL of position.Because of the redundancy in the extracted position information, it need to filter the same information before the data is stored in database. The search service is provided for users to obtain position information conveniently. Therefore, the system mainly includes the position information crawler module, the extraction information module,the duplication deletion module and the search module.(2)Design and implementation of parallel extraction system for position information. Aiming at the problem that the position information is dispersed and thequantity is large, the distributed Nutch is adopted to crawl the URL in the crawler module, and it provides multiple nodes to work simultaneously, it realizes the quick crawl the position information link. Taking into account the position information of diverse and complex, the regularization method combined with the parallel computing framework to realize the position information extraction in extraction module.Because the regularization can accurately analyze the complex information structure and parallel computing has the characteristics of high efficiency, which achieve accurately and efficiently information extraction purposes. In order to reduce the position information redundancy, the MD5 algorithm combined with spark employ the iteration method to filter the same position information in the duplication deletion module. Due to spark memory calculation optimization iterative work load, which improve the efficiency of position information de-duplication. In addition, search function is provided to search the position information from the database.(3)Test the function and performance of the position information parallel extraction system. The F as the comprehensive evaluation of the recall and precision for Position information extraction, which reached 97.6%, the information de-duplication rate was 100%. Test results show that the position information extraction accuracy and information de-duplication rate can meet user needs. Finally,according to the test results, the shortcomings and deficiencies of the system is pointed out, as well as the direction of the next optimization.
Keywords/Search Tags:Nutch crawler, distributed system, regular information extraction, HBase database, MD5 algorithm
PDF Full Text Request
Related items