| As a special category of news and information,campus comprehensive news mainly records the campus life of teachers and students.Faced with a large number of news on web pages,how to quickly understand the main contents or categories of comprehensive news using big data related technology,understand the effect of text clustering,visualize the results of clustering,and reduce the time of clustering,this study can provide certain research ideas for information demanders or related researchers who want to understand the main contents or categories of news.Therefore,it is meaningful to study the comprehensive news on campus network.In this paper,we used 11,456 collected campus comprehensive news as the research object,and used natural language processing and clustering-related techniques,and used k-means clustering and Agglomerative clustering algorithm to test the effect of different clusteriing methods.The main research contents are as follows.1.the study of the number of categories k for clustering.Facing the problem of how to know the approximate category number in advance for a large amount of news information whose category content is not known in advance,this paper adopts a method based on word frequency to determine the category number of news and presents it in the form of a word cloud diagram.In this paper,according to the word cloud diagram,we can roughly determine the number of comprehensive campus news into 7 categories.2.In order to improve the clustering effect of articles,this paper proposed to use the article topic-based method,which clusters the news data by extracting the subject words of each article,and the results showed that the article topic-based clustering method is better than the original word frequency inverse document frequency(TFIDF)-based clustering method in terms of the clustering evaluation index Davies-Bouldin Index(DBI).Then the clustering method based on the combination of word frequency inverse document frequency and latent semantic analysis(TFIDF + LSA clustering)was proposed,and the experimental results of this method showed that it could improve the value of Silhouette Coefficient(SC)of clustering,the value of the Calinski-Harabasz Index(CHI),and could reduce the value of the Davies-Bouldin Index(DBI).3.In order to understand the visualization effect of clustering,t-SNE technique is used in this paper.In this paper,three clustering methods,TFIDF-based clustering,article topic-based clustering and TFIDF+LSA-based clustering,are compared for visualization.The clustering results show that the TFIDF+LSA-based clustering method can improve the visualization of clustering,and from the visualization experimental results,the two methods proposed in this paper can also reduce the clustering time and improve the clustering speed in terms of different clustering algorithms.The number of clusters is set to 7,which is reasonable in view of the error squared and SSE of the three different ways of k-means clustering into different categories.In conclusion,compared with the original TFIDF-based clustering method,the topic-based clustering method proposed in this paper can reduce the value of the clustering index DBI and reduce the time of clustering.The TFIDF+LSA clustering method proposed in the text can not only improve the effect of clustering,but also improve the visualization of clustering,reduce the time of clustering,and have good performance in sum of squares of errors SSE.Figure 25 Table 7 Reference 55... |