Font Size: a A A

Design And Implementation Of Customizable Crawler Engine In Content Convergent Subsystem

Posted on:2019-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:X Y WangFull Text:PDF
GTID:2348330545455580Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The new media business under Web2.0 is no longer confined to the production of media material.The new media business often obtains media material through crawling a large number of media resources websites by the crawler engine.The content convergence subsystem can crawl data from multiple network stations through a customizable spider,and provides these crawled data for the China Broadcasting Cloud platform.However,in order to obtain rich material content,the crawler engine needs to vertically crawl a large number of websites,the number of media websites is large,the structures of different websites are different,the structure of the pages is complex and the data forms are abundant,and the crawled websites often change their structures by time.These problems have brought great development burdens to system developers and brought a great management burden to the system users.In view of the changing business logic of web crawlers,the common crawler framework has a higher threshold for users.Based on the characteristics of the content aggregation subsystem and the specific user requirements,a customizable crawler engine is designed and implemented.The customizable crawler engine helps system users to avoid direct contact with the crawler business code and provide system users with a mechanism to implement data-grabbing logic lightweightly based on descriptive files.Based on the description file,the system user can realize the rapid update,batch management and real-time management of the business logic of the crawler engine.Control crawler execution logic by a flexible combination of system-defined data capture rules.In order to achieve the above functions,this paper has carried on the demand analysis and the key question research to the system function.It is clear that the system should be different from the conventional stand-alone crawler framework,the scalable elastic framework should be implemented,the architecture and working mode of the customizable crawler engine should be determined.Then,based on the function of reptile system application,we analyze the rules which the system should open to users.The rules include the limited area of crawler execution,avoid anti-crawler,crawler data extraction,crawler post-operation,etc.For the rules,the crawler engine should implement the above rules as a regular parser and runner.In addition,the HTTP proxy and page de-duplication mode of the framework's implementation are analyzed.After carrying on the demand analysis and the discussion of the key question,this paper has given the design and the realization of the customizable crawler engine.After the design and implementation of the test to verify the correctness of the function,the full paper is summarized.
Keywords/Search Tags:web crawler, new media, scrapy, software as a service
PDF Full Text Request
Related items