| Social media platforms have become an important way for people to carry out a wide range of activities such as communication,making friends and trade.Location analysis of social media data is one of the research hotspots in the field of data analysis in recent years.Location inference of users on social media platforms refers to the inference of users’geographical location based on their public social media data.Research on user location inference technology on social media platforms can build the connection between virtual network domains and real regions,and provide core technical support for government functional departments to conduct sensitive target positioning,public health monitoring,public opinion tracking and tracing,regional behavior analysis,commercial targeted promotion and other work on national economy and people’s livelihood.However,due to sparse and noisy location information in social media data,supporting factor data of user location inference on social media platforms is difficult to obtain on a large scale,and it is quite difficult to achieve accurate key technologies such as user text feature extraction,user social relationship feature characterization,and multi-source location information potential correlation mining.As a result,the accuracy of existing user location inference methods on social media platforms is poor,which is difficult to meet the actual demand.This paper focuses on the construction of high-precision and versatile location word data set in the target region,the rapid and accurate acquisition of user and blog data in the target region,and the accurate inference of user location in different application scenarios.The main work and innovation of this paper include:1.This paper briefly expounds the research background and significance of user location inference on social media platforms,introduces mainstream social media platforms and their user data,and summarizes the domestic and foreign research status of user location inference related technologies on social media platforms from the aspects of commonly used data sets in existing research,location inference supporting data acquisition,and user location inference algorithm research.On this basis,it is pointed out that the existing location inference technology of social media users is difficult to meet the actual needs due to the lack of supporting data of location inference,the difficulty of accurate extraction of text features,the unexplored structural characteristics of relational graph,and the irrational use of high-level neighbor information.It is made clear that this paper mainly carries out research from the following aspects:reliable acquisition of supporting data,mining of text location features,analysis of social graph structure,and deep data fusion.2.Geographical location words are words associated with geographical location,and they are one of the important supporting data for text data processing and location inference of social media users.However,the data set of geographical location words constructed by existing methods has such problems as noisy data,poor universality and single category of words.A method of constructing location word data set(EMGP)based on electronic map and government platform is proposed.This method collects the Point of Interest(POI)data in the target area based on the electronic map,and obtains the location word data in the target area based on the information of the government platform and administrative division,and constructs a variety of geographical location word data set.Combined with the automatic program and manual processing,the multi-source data can be aligned with administrative division,and the geographical location word can be expanded,deduplicated and selected,so as to improve the reliability and universality of the geographical location word data set.Based on this method,a Chinese geographical location word dataset(Geo CN)is constructed,which contains the Chinese geographical location word list of 34provincial administrative regions,392 prefecture-level administrative regions and 3,160 county-level administrative regions in China,and the mapping relationship between 1,763,476geographical location words and their corresponding locations.3.The acquisition of resident user data in a specified area can provide a large number of sample users with known locations for the location inference research of social media users,which is an important supporting data for the location inference of target users.However,the accuracy and speed of the existing methods to find users are low,aiming at this problem,A user and blog discovery algorithm(MRS-LTR)based on reference relationship strength and local text ratio is proposed.The algorithm obtains a large number of initial blog text in the target region,extracts the seed users and seed text,and constructs the reference graph.Based on the analysis of the relationship between the strength of reference relation and the proximity of geographical location,as well as the correlation rule between the location in the text and the real location of the user,three user recommendation indexes were designed,and a candidate user recommendation method based on the reference frequency and local text ratio was proposed.A candidate user location verification technique based on metadata and iteration probability is proposed to extend the user set and text set of the target region with authenticated users and text set.The experimental results of 4.4 million users and 112 million blog posts show that MR-LTR is superior to the existing best Feedback algorithm,and the accuracy of correction and correction are improved by 14.79%and31.22%,respectively.The discovery rate is significantly improved,and the number of users found within the same time is about 8.7 times of the Feedback algorithm.4.Location information in historical texts of social media users is one of the important bases for the location inference of users.However,due to the noisy content of social texts,fuzzy geographical location and other factors,the accuracy of existing user location inference methods based on generated texts is poor.The algorithm performs data cleaning,text segmentation and other preprocessing operations on the user history blog text,and extracts the location features of the user text from the word segmentation results.Only the POI with stronger location orientation is retained to construct the feature vector.The inverse region frequency of POI was designed,the location differentiation ability of different POI was measured,the likelihood probability of POI generation in candidate regions was calculated,and the correlation between POI and each candidate region was evaluated.On this basis,combined with the inverse region frequency and likelihood probability,the most likely geographical location of users is inferred based on the query likelihood model.The location inference experiment was carried out on the provincial-level data set(3,862k blog posts of 154k users)and city-level data set(3,086k blog posts of 103k users)of Sina Weibo platform.The results show that:Compared with the three existing typical methods,MNB-IGR,MNB-PART and Se FG,which are only based on user text,Pa QL improves the provincial inference accuracy by 7.80%,4.99%and 1.41%,respectively,and the urban inference accuracy by 10.67%,8.38%and 3.72%,respectively.In addition,the recall rate and.1F.value are better than the existing methods.5.The location of a social media user’s local social friends is an important reference to infer the location of the user.However,due to the large amount of false data in social media and the difficulty in accurately describing social relations,the error of existing user location inference methods based on social relations is relatively large.A new user location inference algorithm(Ti SRe N)based on the tight structure of relational network is proposed.The undirected weighted graph was constructed based on the strength of the social relationship,and the graph was collapsed according to the proximity of the neighbor position.The edge weight was updated by combining the second-order neighbor relationship,so as to improve the coverage rate while ensuring the accuracy of positioning.The correlation rule between geographic homogeneity and user social relationship is analyzed,the tight structure of user social relationship graph represented by community is mined,the initial community is divided according to the real location of marked users,and the community belonging of target users is determined based on the module degree of social relationship graph,and then the location of target users is inferred.On Geo Text and Tw US,two of the most commonly used data sets,a comparative experiment of user location inference was conducted with 12 existing typical methods such as MADCEL-W and GCN-LP.The results showed that:Ti SRe N is superior to all methods based on social relationships.Compared with the optimal GCN-LP algorithm in Geo Text and the optimal MADCEL-W algorithm in Tw US,the accuracy rate in metropolitan areas(Acc@161)increases by 7.6%and 12.4%,and the median error decreases by 51.8%and 63.8%,respectively.The mean error was reduced by 10.8%and 35.6%respectively.Moreover,Ti SRe N’s accuracy and median error were superior to all comparison methods,even when compared to methods based on multiple sources of data.6.When user-generated text and social relation data can be obtained at the same time,more location information can be obtained by comprehensive utilization of the two kinds of data to improve the possibility of accurate location inference of users.However,due to insufficient mining of structural characteristics of social relations and insufficient degree of data fusion,the accuracy of existing user location inference methods based on multi-source data needs to be further improved.To solve this problem,a social media user location inference algorithm(UGCC)based on cyclic coupling is proposed.Based on the user’s social relationship and neighbor’s location proximity,the algorithm constructs a collapse-free social graph which enriches the social relationship and reduces the generation of noise information.Combined with the structural information of the social graph,we designed a general feature to measure the relevance between social media users and geographical locations,which is called social aggregation.Based on the location of marked users,a social relation subgraph is constructed,and the probability of users in candidate positions is measured by the structural position of target users in the social relation subgraph.The location information in the user text and the location information in the social structure are cyclically coupled through the social graph to realize the joint enhancement of the two kinds of information,so as to cooperatively infer the user’s location.The results show that UGCC has better performance of location inference compared with 11 existing typical methods such as Re LP and HGNN on Geo Text and Tw US public data sets.The accuracy of urban location inference of UGCC was 40.8%and 50.1%,respectively.Compared with the existing optimal GCN-LP algorithm in Geo Text and the existing optimal MADCEL-W algorithm in Tw US,the metropolitan accuracy(Acc@161)is increased by 4.1%and 0.8%,and the median error is reduced by 35.1%and 23.4%,respectively.Finally,the paper summarizes the full text and looks forward to the next step. |