Font Size: a A A

Design And Implementation Of Inventory Data Collection Platform For Student Accommodation

Posted on:2018-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:L S LiFull Text:PDF
GTID:2348330512497653Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Beijing UNINOVA LTD is an 020 Internet start-ups which provide student accommodation rental information service specially for overseas students in United Kingdom.In the business model of the Internet,on the one hand it's required that companies must provide services with good experience,on the other hand the company need to acquire accommodation information quickly and precisely.Currently the accommodation data is acquired cooperatively from Unite-Students official by email,or business competitors.Then staff will manually update rental information of accommodation.However,it's inefficiant and with high administration cost,and that,in the peak season of accommodation rental,room margin and tenancy term are changed frequently.For business requirement,an automatic way is needed to deal with information synchronization of accommodation between different platform,to get the latest and precise information.Writing a web crawler to crawl web data is an effective mean.Between different accommodation platforms,though the information structure in webpage are similar,but the HTML presentation pages are different.Confronting the customize requirement of web crawling,the key problems in this project are to reduce the workload for writing crawler to reduce cost,designing the system architecture,controlling the crawler module complexity,decoupling the module functions,cleaning data,structuring and importing data.During internship period,I take part in the development of the accommodation backend data center.Refering back to one legacy project,the unfinished Pyspider web crawler application,a new system base on Scrapy was redeveloped.Differ from the main hosting site backend called Livety,data center is called Sharingan.Lively in charge of choosing the certain accommodation data to show it in front-end,and manage user,Sharingan take charge of storing,processing and managing data scraped from different platform as an accommodation database,deploying and scheduling spiders as a spider cloud platform.In the mean time,two backend communicate through a message system,realizing the system's low coupling.In the development,the job content includes:(1)Modeling the accommodation relational database.Formulate a structured data storing model.As a result,it provides a foundation and standard for structuring and importing data.(2)Designing the architecture of data center.Base on integral requirement,with the practice of legacy spider system,set up a general model for web page crawling and scraping.Determined the architecture of new system,frameworks the project use,technology and integration scheme of function modules.As a result,development demand,general design of system architecture and modules are clearly defined.(3)Take charge of the implementation of concrete function modules,developing and integrating sub system including Scrapy spider 's Fragment modules,processor modules,validator modules,spider scheduling,monitoring modules,database import module,message system in data center and so on.As a result,a preliminary practicable integrated system is built.(4)In charge of unit testing,integration testing,system testing of related modules,ensure correct operation of the system.Program error in system and modules are found and corrected by testing.After launching the system,it has a good running status.It scrapes data from several platforms to provide accommodation data service for interior presentation system.its expansibility lay a foundation for a data crawling center with high versatility serving more data consumer.
Keywords/Search Tags:Web Crawler, Web Data Extraction, Student Accommodation, Data Center, Message System
PDF Full Text Request
Related items