The Research And Implementation Of Business Text Hot Topic Discovery System Based On Topic Word

Posted on:2017-11-10

Degree:Master

Type:Thesis

Country:China

Candidate:Z H Zhang

Full Text:PDF

GTID:2348330518494769

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the continuous growth of Chinese language data and the continuous development of Chinese information processing technology,hot topic discovery has been an important research issue in text mining.In order to improve the accuracy of the hot topic discovery and the readability of the hot topic center idea,this paper proposes a new method based on the topic word.The main research contents of this paper include the following three points:(1)combined with the word2vec word vector algorithm and kmeans means clustering algorithm to achieve the extraction of the subject word.First,business text corpus is usually in the form of short text,the amount of information is relatively concentrated,and contains the time,the name,place names and other junk information.Combined with NLPIR segmentation tool written document pre-processing procedures,business text corpus learning new words,denoising,segmentation,document processing,then through a combination of word2vec word vector algorithm and K-means clustering algorithm program of words topic clustering,generating a plurality of topic word.Finally,words of each subject,through a combination of TFIDF algorithm is programmed to compute word weight and according to the weight of words of quick sort,select sorting by a number of words is the key words.After the word cloud algorithm is used to generate word cloud,users can access the subject words clearly and can clearly summarize the main content of the topic.Experimental results show that,the method can find the similar words which can reflect the theme and with LDA topic model clustering extraction key words compared with high accuracy.(2)Combined with the LDA topic model clustering algorithm and kmeans means clustering algorithm,a hot topic discovery algorithm is described.This paper describes the hotspot detection method is done in the LDA Algorithm Based on improved the LDA topic model clustering documents to the theme of the Dirichlet distribution to generate documentation in multiple subject dimension of document vector,and then the kmeans clustering algorithm of document vector of topic clustering generate 100 themes and according to each topic contains the number of documents on the theme of sort,the number of documents most of 20 theme is what we found 20 hot topic.Compared with the improved algorithm,the improved algorithm is compared with the former algorithm,and the kmeans means clustering algorithm is used to generate the document vectors using the probability model of LDA.In this paper,we use the improved LDA document vector clustering algorithm and improved LDA topic model clustering algorithm for text corpora of hot topic detection experiments,found by the comparison of the two methods of hot topic and their intersection between the Euclidean distance is used to validate the experimental results.Experimental results show that the improved LDA document vector clustering algorithm is better than the LDA topic model before the improvement,and the experimental method is feasible.(3)A method to realize the visualization of topic by searching the topic sentences which can be summed up in the topic.First,topic document according to the period separated split into sentences and word segmentation,segmentation results to retain only subject headings;again,the use of LDA topic model on the topic of the text clustering,according to the LDA topic model clustering in the sentence to the theme of the Dirichlet distribution to generate sentences in multiple subject dimension sentence vectors;then computed the sentence vectors to the center point and each sentence vector to the center point of the Euclidean distance is calculated.Finally,sort of a sentence by sentence topic word weight and sentence vector to the center of the distance,ranking by before the sentence is the topic summarized statement.The experimental results show that the distance between the topic sentences and the topic sentences in this paper is much closer than that of the topic documents obtained by this method,and the method is feasible.

Keywords/Search Tags:

Subject terms, LDA, vector, Euclidean distance, hot topic

PDF Full Text Request

Related items

1	The Research Of Pattern Recognition Arithmetic Base On Digital
2	The Application Research Of CTM Topic Model In Subject Subject Recognition And Subject Document Classification
3	A Study On Packed Detection And Exuviate Based On Weighted Euclidean Distance
4	Research On Topic Detection Based On Adaptive Gravity Vector
5	Research On Terms Co-occurrence Based Models And Algorithms For Text Mining
6	An Intrusion Detection Method Based On Euclidean Distance
7	Research On Adaptive Waveform Design Based On The Euclidean Distance Between PDFs
8	Improvement And Research Of Gesture Recognition Algorithm Based On Leap Motion
9	The Developmeng And Research Of Electric Power Information Monitoring Management System
10	Research On Short Text Classification Based On Topic Model