Font Size: a A A

Research On Beijing Housing Prices Based On Web Crawlers

Posted on:2019-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhengFull Text:PDF
GTID:2428330545456440Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the information revolution,the flourishing development of the Internet and the change of lifestyle,the Internet has become a necessity for our daily lives.Faced with a large number of network resources,learning to use search engines in a reasonable way can bring us a lot of convenience.In the search engine,we use some keywords to search,it will return us a lot of information related to this keyword,in the process,we must mention is the crawler technology.The search engine is fully using crawler technology to capture information associated with massive information networks and respond to us quickly.However,with the advent of the data era,there is a tremendous amount of information on the Internet.With the change of technology,anti-reptiles have also become more and more important and intensified.Extracting the information we need has become more and more difficult.In addition,house prices are now one of the most popular vocabulary in our lives.It is closely related to each of us.Therefore,the price data is also worth a lot of research.The use of computer network crawler technology in housing prices is one of the contents of this paper.First of all,we need some data on house prices.Before that,we need to determine the source of the price data,that is,the target website.A comparative analysis of several well-known real estate information sites,and finally determined to "enjoy home" as the object of crawling.Then,it is how to retrieve the house price data.The work in this area can be accomplished through computer crawler technology.For the crawler itself,in this article I use a new type of web crawler framework called elastic-spider,which is a distributed crawler framework based on java language development.This is also one of the key research contents of this article.At this stage,the widely used web crawler frameworks are Nutch,Scrapy,Crawler4 j,etc.However,they all have some defects.Nutch's custom crawling ability is very weak,and if the number of clusters is too small,crawling efficiency is low;Scrapy The crawl speed is slow;Crawler4j does not support crawling of dynamic web pages,ie,it does not support AJAX requests.The elastic-spider web crawler framework can solve several problems mentioned above.It has three major advantages.First,the framework is asynchronous,so the execution efficiency is very high.Second,the framework supports distributed crawling,because the single node in the cluster hangs and the entire service is unavailable.Third,the framework is extremely extensible,and modules such as downloading,parsing,and storage are all supported by developers.The web crawler program implemented in this research is based on the elastic-spider crawler framework.The reptile crawled a total of 1,250 housing prices in Beijing,which took 25 minutes,that is,an average of 50 real-time information per minute,and broke through the target site's anti-repeat strategy,the overall crawl speed is still very efficient.Not only that,the crawler has also been deployed on seven physical machines.The stability is very high.The reptile service will not be unavailable because one of the physical machines in the cluster hangs.Finally,through the research of two research methods,decision tree analysis method and KNN classification algorithm commonly used in the field of data mining,the house price data and data mining technology are combined.After data preprocessing,analysis modeling,forecasting and other research,the final research results were obtained.
Keywords/Search Tags:Web Crawlers, Elastic-spider, House Price, Data Mining
PDF Full Text Request
Related items