| With the continuous growth of Chinese language data and the continuous development of Chinese information processing technology,hot topic discovery has been an important research issue in text mining.In order to improve the accuracy of the hot topic discovery and the readability of the hot topic center idea,this paper proposes a new method based on the topic word.The main research contents of this paper include the following three points:(1)combined with the word2vec word vector algorithm and kmeans means clustering algorithm to achieve the extraction of the subject word.First,business text corpus is usually in the form of short text,the amount of information is relatively concentrated,and contains the time,the name,place names and other junk information.Combined with NLPIR segmentation tool written document pre-processing procedures,business text corpus learning new words,denoising,segmentation,document processing,then through a combination of word2vec word vector algorithm and K-means clustering algorithm program of words topic clustering,generating a plurality of topic word.Finally,words of each subject,through a combination of TFIDF algorithm is programmed to compute word weight and according to the weight of words of quick sort,select sorting by a number of words is the key words.After the word cloud algorithm is used to generate word cloud,users can access the subject words clearly and can clearly summarize the main content of the topic.Experimental results show that,the method can find the similar words which can reflect the theme and with LDA topic model clustering extraction key words compared with high accuracy.(2)Combined with the LDA topic model clustering algorithm and kmeans means clustering algorithm,a hot topic discovery algorithm is described.This paper describes the hotspot detection method is done in the LDA Algorithm Based on improved the LDA topic model clustering documents to the theme of the Dirichlet distribution to generate documentation in multiple subject dimension of document vector,and then the kmeans clustering algorithm of document vector of topic clustering generate 100 themes and according to each topic contains the number of documents on the theme of sort,the number of documents most of 20 theme is what we found 20 hot topic.Compared with the improved algorithm,the improved algorithm is compared with the former algorithm,and the kmeans means clustering algorithm is used to generate the document vectors using the probability model of LDA.In this paper,we use the improved LDA document vector clustering algorithm and improved LDA topic model clustering algorithm for text corpora of hot topic detection experiments,found by the comparison of the two methods of hot topic and their intersection between the Euclidean distance is used to validate the experimental results.Experimental results show that the improved LDA document vector clustering algorithm is better than the LDA topic model before the improvement,and the experimental method is feasible.(3)A method to realize the visualization of topic by searching the topic sentences which can be summed up in the topic.First,topic document according to the period separated split into sentences and word segmentation,segmentation results to retain only subject headings;again,the use of LDA topic model on the topic of the text clustering,according to the LDA topic model clustering in the sentence to the theme of the Dirichlet distribution to generate sentences in multiple subject dimension sentence vectors;then computed the sentence vectors to the center point and each sentence vector to the center point of the Euclidean distance is calculated.Finally,sort of a sentence by sentence topic word weight and sentence vector to the center of the distance,ranking by before the sentence is the topic summarized statement.The experimental results show that the distance between the topic sentences and the topic sentences in this paper is much closer than that of the topic documents obtained by this method,and the method is feasible. |