| Social networks play a very important role in daily life.Users can share content and interact with other users through social platforms,and these user behaviors construct a virtual social network in the cyberspace.For the topology information of social networks,by analyzing the social relationship paths between users in the network,the propagation path of information can be found and the implicit relationships between users can be excavated.However,due to the large scale of nodes in social networks and the randomness and blindness of traditional sampling algorithms,traditional sampling algorithms are prone to large search scales when collecting data,making it difficult to collect available social relationship paths.Therefore,this thesis studies this problem based on a social network dataset of Twitter.The main work and contributions of this thesis include the following aspects:(1)A social network dataset was constructed,which includes the topology information of the social network,user attributes,and content data.Compared to open-source social network datasets,this dataset has enriched information,as it includes user node attributes and content data.(2)A path searching algorithm based on the A* heuristic path planning algorithm was proposed for social networks.The heuristic function is based on the homogeneity principle of social networks,using the Doc2 Vec model to establish a mapping relationship between user similarity and social distance.Experimental results show that compared with Dijkstra algorithm and Best First Search algorithm,this algorithm can effectively reduce the number of network nodes searched and maintain a high level of path optimality.(3)This section discusses the proposed online sampling algorithm for social networks,which is based on the social network path search algorithm.This algorithm performs sampling during the social network path search process.In this thesis,experiments were conducted to compare and analyze the degree distribution properties,clustering coefficient,assortativity coefficient,and reciprocity coefficient of the sampled subgraphs obtained by the proposed algorithm and other sampling algorithms,such as BFS sampling,snowball sampling,forest fire sampling,RW sampling,and MHRW sampling.Additionally,this thesis also tested the social relationship path sampling ability of the above sampling algorithms.The results show that under different social distances,the proposed sampling algorithm can collect valuable social relationship paths with the minimum number of samples.(4)Designed and implemented a highly scalable social data collection system.This system has the ability to customize collection resources,algorithms,and features,allowing for flexible adjustments based on different collection scenarios and optimal utilization of collection resources.In summary,this thesis proposes a social network sampling algorithm based on the A* path planning algorithm,which overcomes the problem of traditional sampling algorithms being unable to collect effective social relationship paths.At the same time,a highly scalable social data collection system is designed and implemented,which has good collection performance and scalability. |