Font Size: a A A

Hot Topic Detection Strategy Of Micro-blog Based On Latent Semantic Analysis

Posted on:2014-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:W W MaFull Text:PDF
GTID:2268330392972470Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a social media, the application of micro-blog is deep into people’s daily life,which becomes an important platform for people to publish, transform information andacquire knowledge. Text is the main carrier of Internet information, which containsplenty of citizen’s opinion and ideological tendency. Thus, it has great value forapplication and study in public opinion analysis and topic detection.However, most messages in social media are quite short, the incompleteness,massiveness and singularity make it difficult to detect hot topics. This thesis analyzesthe characteristics of Chinese micro-blog information firstly; and then combinesresearch of topic detection at home and abroad with related technologies; lastly, wepropose a hot topic detection method that applies to the Chinese micro-blog text. Themain contents of this thesis are as follows:(1) Short texts inherently have sparse features and unbalanced samplecharacteristics, which make it inappropriate for short text to transplant the traditionallong text feature weight method mechanically. For this, an algorithm based onIntegrated Category Frequency (ICF) for short text feature weighting was proposed.This algorithm introduces the concept of inverse document frequency and relevancyfrequency, and integrates the distribution of sample in positive category and negativecategory. The experimental results show that, comparing with other feature weightmethods, both the Micro-average and Macro-average of this method are above90%,which can enhance the sample categories distinguishing ability in negative category, andimprove the precision and recall of short text categorization.(2) Analyze the micro-blog with latent semantic analysis method. Traditional vectorspace model usually base on feature matching, but a large number of synonyms,polysemy exist in network text result in text similarity calculation inaccuracy. Thisthesis does singular value decomposition on the original term-document matrix andcomposes a new semantic space with some larger singular value features. The newmatrix not only retains most of the useful information in original matrix, but alsoreduces the dimension of vector space significantly.(3) Make a hybrid clustering algorithm based on hierarchy and partition.Hierarchical clustering is high accuracy but time-consuming; Partition based K-meansalgorithm does cluster fast, but the randomness of initial input parameter may lead to instability issues. After analyzing the advantages and disadvantages of above algorithm,a hybrid clustering algorithm combining the hierarchical and partition algorithm is putforward. This algorithm clusters the data set by agglomerative algorithm firstly,receiving initial cluster centers and the number of cluster, then uses the K-means torefine further. The experimental results show that this algorithm improves the efficiencyand accuracy of topic detection to a certain extent.(4) With the definition of hot micro-blog proposed, combining the feature weightalgorithm of ICF, hybrid clustering algorithm and latent semantic analysis, a strategy ofhot topic detection of micro-blog based on latent semantic analysis was proposed, andvalidates it by practices. Practices indicate that this strategy can solve the highdimension, synonymy problems, and topics derived from micro-blog are much closer tothe real hot topics.
Keywords/Search Tags:Micro-blog, Feature Weight, Latent Semantic Analysis, Text Clustering, Hot Topic Detection
PDF Full Text Request
Related items