Font Size: a A A

Research And Application Of CMS Identification System Based On Web Crawler

Posted on:2018-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2348330518959434Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The number of current network resources are increasing exponentially,and with the maturity of network technology,the content management system(CMS)is well known by the Internet.CMS system is based on the modular design concept,you can easily create news sites,social blogs,anime games,video movies and other professional sites or integrated Web site.As the many open source of CMS system,the versions being constantly updated,and emerging systems are rapidly rising,the kind of CMS which are used to establish sites for enterprise and individuals is becoming more and more.Or that is a hundred schools of thought contend,or that fish is mixed.However,for all network professionals,the technical selection is a key step in any Internet project.whether creating a basic site or to re-design web application,making a "Competitive Ananlysis" or doing the pre-demand planning,they have to make a wise decision in the diverse technical environment.Therefore,the technology selection of CMS is also a essential process for users who rely on open source CMS building sites.In view of the market demoand of CMS technology selection,this paper will discuss the requirements of CMS technology selection and present the feasibility analysis and demand analysis based on the statistical analysis of the CMS system,otherwise,design a CMS identification system to provide users with relevant information and functions to help users Complete CMS technology selection work.The research and development of the CMS identification system is composed of the crawler client and the web application server.The research data of the system is based on the data crawling of the web crawler client,so the research of the web crawler client is the emphasis of this paper.This paper will expanse and complete the function based on the original open source crawler framework of go_spider,present the depth of customization and implementation of CMS identification system and statistical analysis of data distributed crawler client,while the evaluation of the crawler system performance.Afterwards,develop the client of CMS identification system and achieve the crawler system based on the corresponding data.The main work is as follows:First,the crawler technology is described,including the introduction of the common crawler framework,the common reptile collection strategy and the URL re-algorithm of the acquisition process.Furthermore,examines the Golang concurrent programming techniques and the Redis distributed storage data technology.Then,the feasibility analysis,requirement analysis and overall frame design of the CMS identification system are carried out to provide the basis for the detailed design and development of the system.Second,with the crawler framework of go_spider and the data demand of CMS identification system,the function is extended in this paper,and the CMS crawler recognition system is designed and tailored to expand a detailed analysis and design with the data acquisition needs,reptile strategy,data storage and other functions.Third,The implementation process of the CMS identification crawler client,the scheduler module,the middleware processing module,the data download module,the parser module and the data storage module are described,and the system operation evaluation is carried out.Fourth,Using the relevant data collected by the crawler client to complete the web server function development of the CMS identification system.Which web server features include CMS type identification,access to the mainstream CMS market share analysis data,query the sites that use the same CMS system Alexa top20 and the same site type under the Alexa top 20 sites.The CMS identification system designed by this paper,is not only realized the CMS identification system crawler client system based on the distributed reptile technology.But also the CMS identification system web server can solve the CMS technology selection problem under current market,which has meaningful reach significance and practical application value.
Keywords/Search Tags:content management system, CMS identification system, Web Crawler, Golang, Distributed storage
PDF Full Text Request
Related items