Font Size: a A A

Research On Key Technologies Of Short Text Hot Topic Detection

Posted on:2018-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:H L GaoFull Text:PDF
GTID:2348330536480834Subject:Public Security Technology
Abstract/Summary:PDF Full Text Request
In recent years,new Internet application,such as micro-blog and WeChat,has gradually become an important way for information exchange between individuals or organizations.It creats more information than the traditional news sites,blogs and other online media.These short text have rich content involving in all aspects of society,so it is of great value for data mining.However,due to the lack of short text feature and low information density,the traditional method of text analysis is difficult to be used to analyze the short text,which puts forward new challenges to the network public hot topic analysis technology.At present,the mainstream method of text hot topic analysis is based on the text clustering and then get each hot topic cluster.When the method is applied to short text,it has to faced with three main problems: the establishment of short text representation model,short text similarity calculation and short text clustering.This paper mainly focuses on the key technologies to solve these three problems.Firstly,this paper proposes a short text semantic vector model.In the process of short text modeling,the the method uses distributed word vector created by Word2 Vect to expand semantic information of each word in the original short text,whichi solves the problem of lacking short text feature.In order to make the mathematical operations between short texts effectively,this paper combines the traditional dictionary technology to establish the semantic vector space of short text.Each dimension in space is a word in the dictionary,and the value of each dimension is the maximum value of the cosine distance between the short text and the word vector on the dimension.This method realizes the quantitative representation of the semantic extension of the short text.Secondly,after the short text semantic vector model is realized,the similarity calculation of short text is studied.According to the characteristics of the short text semantic vector,this paper combines cosine distance calculation formula as the similarity calculation method of short term semantic vector.This short text similarity calculation method can preserve the original semantic information in the text as well as cutting down the complexity of the computational process.Finally,in order to solve the problem of short text clustering,this paper introduces the spectral clustering algorithm and improves it.The construction of adjacency matrix in spectral clustering is one of the key problems of spectral clustering.In this paper,we use the similarity calculation method based on the short text semantic vector to construct the semantic similarity matrix of the short text as the neighbor matrix and designed the process of short text hot topic detection by spectral clustering.In order to solve the problem that the k-means algorithm is sensitive to the initial cluster center and the K value is uncertain in the process of clustering,this paper introduces the method of density peak detection of density peak clustering algorithm to determine the initial cluster center and the K value,and some good results are obtained in practice.
Keywords/Search Tags:Hot topic detection, Short text, Short text semantic vector, Spectral clustering
PDF Full Text Request
Related items