For a long time,indemnificatory housing has been a major livelihood issue of concern to people.The state adheres to the policy of non-speculation in housing to ensure the healthy and stable development of the real estate market.Based on the survey of the second-hand housing market in Tianjin,this thesis finds that the information of various real estate intermediary systems is uneven,which is easy to cause information difference in the transaction process,causing economic losses to the buyers.In order to avoid the occurrence of such problems,this thesis uses the statistical methodology combined with the actual market situation to carry out statistical regression on the constituent factors of the second-hand house price,and finally realizes the prediction of the transaction price of the second-hand house in Tianjin through the integrated model,which makes the evaluation of the second-hand house price more scientific,standardized and reasonable,and has more practical guiding significance for the scientific purchase of the house buyers.First,the data source.Through offline visits and surveys of real estate intermediary companies,it is found that Shell and LinkedIn have good service and professional level,high recognition in the industry,and extensive business coverage.Through online comparison of Anjuke,Xingfuli,Fangtianxia,LinkedIn and other large intermediary system platforms,it is found that LinkedIn has the most complete information.Therefore,this thesis selects Tianjin LinkedIn data as the basic data source for analysis and modeling of second-hand houses.Secondly,the indicator data processing and analysis.Because of the variety of data formats of crawling,firstly,SparkSql is used to clean the crawling data to make it conform to the standardized format of indicators.Then,the systematic analysis of the indicator data shows that there is a significant linear relationship between the transaction price of the house and the building area,listing price,house type,geographical location,education,etc.Thirdly,feature selection and algorithm modeling.First,through the comparison of the importance of features,select indicators with high correlation as the characteristic variables of data modeling.Then use multiple linear regression,XGBoost,LightGBM and random forest algorithm to model,and get the training results and prediction results.According to MAE and other evaluation criteria,it is found that LightGBM in the independent model works well,but is not stable enough.After investigation,this thesis finally uses a unified model to integrate all independent models,and the empirical integration model has better generalization effect and performance stability.Finally,the research conclusions are summarized.Through the comparison and verification of model prediction,the integrated model has strong generalization ability,good interpretation effect for new data,and has guiding significance for the scientific purchase of house buyers.At the same time,according to the analysis,it is found that the stable development of real estate has both market and policy factors.The government agencies are required to tilt the resources of the regions with unbalanced development so as to make the regions develop in a balanced way;We need real estate agents to promote the standardization of system information and guide sellers to make reasonable pricing.Consumers need to fully investigate and make reasonable choices to avoid economic losses. |