Font Size: a A A

Analysis Of Network Public Opinion Data Based On Short Text Clustering

Posted on:2020-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:Q R LiFull Text:PDF
GTID:2428330599451292Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Internet public opinion refers to the public's release and expression of opinions and remarks about a hot event in society through the Internet platform.With the rapid development of "Internet +",social media has subtly changed the way people interact in society.More and more people are communicating through social networking platforms such as Weibo,WeChat and forums,while short text data is widely used and exists in these social media.Short text data carries a large amount of user information,and simultaneously sends public information.A variety of short text data is flooding the network,forming a network sensation to a certain extent.How to better handle short text data and discover the hidden topic of lyric data has become an important research content of online public opinion data analysis.In this paper,the limitations of short text clustering and network public opinion topic mining process are reduced,and the influence of short text feature sparse on network public opinion data analysis is reduced.Therefore,the traditional K-means clustering algorithm is improved by means of machine learning clustering algorithm.The central idea is to optimize the first stage Canopy algorithm preprocessing.The overlapping subset formed by each Canopy algorithm is called a cover set,and the distance from each point to all centers is no longer considered like the traditional K-means algorithm.Instead,calculate the distance from the point to the center of the collection.As the K-means algorithm is iterated,each set-up center will continue to change until it converges.On this basis,the BK-means clustering algorithm is proposed.The network sentiment data analysis proposed a BTM-based lyric topic mining module.By improving the TF-IDF weighting algorithm to reduce the influence of word frequency on weights,the BK-means clustering algorithm was used to mine the topic words.The main issues and innovations in this paper are as follows:(1)For the challenges of short text clustering,such as sparse feature,high dimension and noise interference,the traditional clustering method based on vector space model is not ideal for short text data processing.By improving the K-means clustering algorithm,the Canopy algorithm is introduced to pre-process the stage of selecting the initial clustering center,and the selection of the initial clustering center is optimized.On this basis,the BK-means clustering algorithm is proposed.The algorithm only compares the distance between the object and the center of the cover set in the same area at a time.By reducing the number of comparisons,the running time of the entire cluster is greatly reduced,and the computational efficiency of the algorithm is improved.Experiments show that the short text clustering algorithm based on BKmeans is better than the traditional short text clustering algorithm,and both F-measure and purity are improved.(2)The lyric subject mining module proposes to model the short text data based on the BTM topic model,improve the TF-IDF weighting algorithm for text similarity metrics,and replace the traditional TF-IDF algorithm with logarithmic function log and square root.The tf value reduces the influence of word frequency on weight,adapts to the characteristics of lyric short text,better represents short text,and then uses BK-means clustering algorithm to find subject words,which effectively improves the quality of keyword discovery.In addition,the F value and the purity value are used as evaluation indexes in the clustering algorithm.Through the comparison experiments on the crawled dataset,the BK-means algorithm proposed in this paper and the performance of keyword mining based on BTM model are comprehensively evaluated.The experimental results show that the BK-means algorithm proposed in this paper can effectively alleviate the influence of short text data sparsity when dealing with short text data.The F value and purity value are significantly improved compared with the traditional K-means clustering algorithm.In the BTM-based lyric subject mining module,the effect of word frequency on weight is reduced to some extent by improving the TFIDF algorithm.Finally,the BK-means algorithm is more relevant than the keywords found by the traditional method.
Keywords/Search Tags:Short text, BK-means algorithm, Biterm topic model, Topic keyword discovery
PDF Full Text Request
Related items