Font Size: a A A

The Design And Implementation Of Verification Code Recognition Module In "Tianyancha" Distributed Crawl System

Posted on:2018-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y ZhangFull Text:PDF
GTID:2348330512998438Subject:Software engineering
Abstract/Summary:PDF Full Text Request
"Tianyancha" is a comprehensive enterprise information query,professional enterprise relationship mining tool platform,which queries business information,legal proceedings,trademark patents,foreign investment,bidding,dishonesty,operating abnormalities,corporate annual reports,recruitment and news,covering the country over 80 million enterprise information,keeping updated with the State Administration for Industry and Commerce website."Tianyancha" visualizes information of the relationship between subjects via the Internet,providing users with comprehensive and reliable enterprise data analysis which could help them find more hidden business interests,suitable for finance,lawyers,journalists and business people to keep abreast of business conditions,able to insight into business information.However,when crawling the Internet public information,various types of the verification code will be encountered,such as filling Chinese idioms,piny in,arithmetic,English alphanumeric characters,etc.,manual identification or traditional identification method cannot adapt to a large number of datum crawling demand.Therefore,it is necessary to design a system which recognizes verification code efficiently,improving the speed of information acquisition and providing support for future data mining.This project,which based on a true application project in the company,derived from the verification code recognition system of "Tianyancha",designed and realized with the technology of deep learning.The detailed work of the paper includes complete of the requirement analysis of the verification code recognition system;design of the technical architecture.The system function is decomposed into three relatively independent part:the verification code training subsystem based on deep learning,the verification code identification service subsystem and the crawler application subsystem,in which three parts completed the summary design,detailed design,and implementation;managed the original Spring,Redis technology architecture to match the structure of the upgraded design,besides the system function test is also completed.The results of this paper have been successfully applied to the actual production of"Tianyancha" platform,the verification code recognition rate is high,which greatly improves the crawling efficiency.The software involved in the paper has also been successfully applied to the software copyright.The results also imply that the machine learning,especially the deep learning,in the field of verification code recognition,has great prospects for application,which worth further study.
Keywords/Search Tags:Deep Learning, Verification Code Recognition, Distributed Crawler, Caffe, Deep Neural Network
PDF Full Text Request
Related items