Font Size: a A A

Research On IP City-level Geolocation Based On Random Forest

Posted on:2021-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q LeiFull Text:PDF
GTID:2428330620963462Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Since the 21 st century,the Internet developing rapidly,has become an indispensable tool for people in their daily lives.With the popularity of the Internet,online services and network communications have become a trend.Personalized push services on the Internet,such as targeted advertising,automatic selection of web languages,real-time local news push,and traceability tracking of network security issues,all require IP geolocation technology,which determines geographical location according to each network host's unique IP address.Although there are many excellent IP geolocation technologies,there are more or less limitations,such as the low accuracy of network measurements and the inability to accurately measure the relationship between variables..Therefore,this paper mainly proposes an IP city-level geolocation method based on data mining.This method uses the IP address itself as features and uses a random forest algorithm to train a classifier to obtain a good prediction result.This paper studies and analyzes the existing classic IP geolocation methods,pointing out their shortcomings,and proposes an IP city-level geolocation model based on random forests.First,in the model design,in order to obtain a high-precision IP training set,a data fusion of different source databases is proposed,and a database fusion algorithm is introduced that introduces a heap structure.The algorithm mainly focuses on the attribute fusion of each database's IP records.In the experiment,two different combination methods of databases were selected.Through comparative analysis,it was found that the results of the second group of experiments were better.The province information in the specific group could be identified,and the city recognition rate was increased by 19 times.Secondly,this article extracts IP data from 13 cities in Hubei Province of the same operator in the new fusion database as training samples,and discards the traditional methods that use network measurement information such as delay and hop count as features.Four bytes of IP address are used as the four characteristics of model training to generate a single decision tree classifier and a random forest classifier.Experimental results show that the random forest model is better than the decision tree model to a certain extent,and its prediction accuracy rate reaches up 97.89%.Finally,the research method of comparative analysis finds that it is feasible to perform machine learning classification with IP itself as a feature.In addition,in the localization method based on domestic IP data,the random forest algorithm is better than Naive Bayes Algorithm to a certain extent.
Keywords/Search Tags:IP geolocation, IP database, feature, random forest
PDF Full Text Request
Related items