Font Size: a A A

Research Of Online Clustering Method Of Short Text In Social Media

Posted on:2020-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y JiangFull Text:PDF
GTID:2428330596476020Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Short text is a common form of content on the Internet,such as social media text,advertising keywords,opinion comments,web page titles,search queries,and more.Short text online clustering of social media refers to real-time incremental clustering of short text streaming data of social media.It facilitates the sorting and automatic summarization of massive news data.And it is important for public opinion analysis,disaster warning and event detection.Social media short text is quickly generated in the form of a text stream,usually with a huge amount of data.At the same time,social media texts are characterized by irregular expressions,a large number of errors,and short text content.The traditional text clustering method constructs features from the perspective of the word itself for clustering,which cannot be applied to high-noise,high-sparse social media short text clustering,and also lacks a solution for streaming text data clustering.To this end,this thesis mainly studies from two aspects: short text similarity measure and online clustering method.The main contributions are summarized as follows:(1)This thesis proposes a multi-attribute fusion social media short text similarity measure method.This method is aimed at the data characteristics of short social media text length and lack of information.The author use the part-of-speech recognition technology and named entity recognition technology to enrich and extend the traditional vector space model.In order to make up for the shortcomings of vector space model in dealing with complex semantics,the author use The topic model developed a set of short text topic vector inference techniques to identify the connections between related words in the text.At the same time,the event elements are supplemented by other information such as entity,time,and geographic location information on the social media platform.Finally,the three methods are organically combined,and the accuracy of the combination method on the text similarity evaluation task is higher than the traditional text similarity evaluation method.(2)This thesis proposes an online clustering method based on label propagation,BatchLPA.This method is aimed at the traditional streaming data clustering method.SinglePass has the disadvantages of low recall rate and complicated parameter setting.Unlike SinglePass,BatchLPA is no longer simply adding new text to the most similar cluster,but retaining the relationship between new text and all clusters.And then the author use the simple and fast community partitioning method label propagation algorithm to divide and aggregate the clusters and texts in the similarity network.This ingenious way indirectly processed the historical data for a second time at a very small cost,reducing the loss of information.Experiments show that BatchLPA not only guarantees the quality of clusters,but also the number of clusters is more reasonable,and the dependence of algorithm performance on parameters is weaker.
Keywords/Search Tags:social media, short text similarity, streaming data, online clustering
PDF Full Text Request
Related items