| In information age,the massive data resources accumulated on online social platforms contain a lot of valuable information.How to make foll use of this infor-mation to better urban governance is of great significance.This article mainly dis-cusses the public's evaluation and impression of the cities on social networks,so as to understand the city and promote the refined city management.This article takes Zhihu Platform as a case,and comprehensively uses text seg-mentation,text clustering and other related technologies in natural language pro-cessing for 14 cities to present a complete urban label analysis solution.First,the article uses python language's web crawler to get original data,and performs data pre-processing such as tag acquisition,tag filtering,and tag specifica-tion processing.These include multiple operations such as data cleaning,text seg-mentation,removal of labels that do not conform to rules,and synonymous replace-ment to obtain the original set of labels.Second,a vector space model is established,and feature words are extracted.Four methods of tf-idf,document frequency,infor-mation gain,and χ2 statistical test are introduced.The analysis of city labels is mainly carried out from the following aspects:(1)City keywords.Combine textrank and tf-idf to assign weights to the tags,sort to get the key tags of each city,and draw a city tag cloud and a city-oriented network.(2)Urban theme.Use the LDA theme model to explore the potential abstract themes of public impressions in each city,and then use the resulting textual infor-mation to find the four types of abstract themes contained in all texts.For example,the theme of Chongqing is composed of hot pot,light rail,mountain city,Hongyadong,summer and other words with different possibilities.(3)Organization of labels.A hierarchical labeling system is constructed for the obtained flat,unordered label set.Word2vec is used to train the word vectors of the text,and spectral clustering is used to divide the labels and analyze the characteristics of the labels in different categories.(4)City cluster analysis.Utilizing the obtained tags to classify the cities using three clustering methods,kmeans,dbscan,and mean-shift,to build a city community.In addition,clustering the cities in three more fine-grained dimensions of infrastruc-ture,food culture,and urban groups,and giving model evaluations to analyze the re-sults,help to indicate the common direction of development between cities.The innovation of this article is that we provide an improved solution for label organization.In addition to word vector training on the text itself,we introducing a standard corpus for word vector training.After obtaining the similarity between words,transfer and match them.Key tags are then clustered.As long as the text information in the corpus is rich enough and the language is sufficiently standardized,the rela-tionships between the words obtained are also credible.In addition,an improved al-gorithm combining a dictionary and spectral clustering is proposed.The domain dic-tionary is used to change the initial state of cluster analysis,improve the scientificity of label division,and at the same time,it can better indicate the inherent meaning of the label class.In sum this article provides a reference for the complete analysis and processing of the text information published by users on social platforms.It provides specific practices for the application scenarios of related technologies,and provides new ideas on how to build a label hierarchy. |