Font Size: a A A

Microblog Hotspot Detection Based On Semantic Analysis And Two-step Clustering

Posted on:2015-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:N WuFull Text:PDF
GTID:2308330482453359Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
In the era of web 2.0, the Internet and communication technology developed rapidly, the way to accepted and released for information have great changes. Due to low barriers to entry:microblog have simple operation, optional content editing. The content to publish and spread reflects the user’s interest and discussion, it also can help the information supervision department in timely manage the spread of emergency. To solve the lack of semantic understanding and the limitation of clustering algorithm in the traditional method of hotspot, used the information gain and the latent semantic analysis to construct a word-document matrix, the two-stage clustering algorithm is put up which uses an improved K-means in hotspot detection as well as incremental clustering algorithm in hotspot refreshing. Finally, build the topic heat evaluate computing model to calculate heat number of cluster results.Based on hotspot discovery process the main contents of this paper divided into the following there aspects:1. Data collection and cleaning. By the study of hotspot detection, the main influence factors are title, content, forwarding number, comment, author and publication time. Only retain one content in same results, clear the contents of untreated HTML tags, remove the null value and advertising and other noise, remove stop words.2. Data processing and document representation. Represents the text from the perspective of semantic analysis. Therefore used information gain to select feature words which can retain more implicit information of low-frequency vocabulary. Using vector space model to construct the by words-documents matrix which has a high dimension and noise solved by singular value decomposition in latent semantic analysis.3. Hotspot detection. Using two-step clustering to implement hotspot detection. Through related portals analyzing and artificial classification, determined the interval number of hotspot as the number of clusters range for K-means. Using incremental clustering for the newly added data update topic. By calculating the heat value of each microblog in the topic cluster, then summarize it can come to the topic heat. In descending order based on the each topic heat, the end result is regarded as a hotspot.
Keywords/Search Tags:latent semantic analysis, two-step clustering, similarity strength, hotspot detection, heat assessment
PDF Full Text Request
Related items