Font Size: a A A

Research And Implementation Of Topic Hot Sorting In Article Clustering

Posted on:2020-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:G F ZhangFull Text:PDF
GTID:2428330596998357Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the contemporary,with the rapid development of the Internet,people's daily reading has gradually shifted from newspapers and books to online reading.The businesses have also started to publish articles,news,notices and other important information on the Internet.There are thousands of articles of various types on the Internet for people to read every day,the Internet has become an important platform for people to express their thoughts,exchange feelings and learn about current events.Therefore,it is of great significance to find potential hot topics from mass texts and present them to the public in an orderly manner.This paper studies the common clustering techniques in natural language processing,and the algorithm is improved according to the defects of k-means clustering algorithm.The improved kmeans algorithm is used to cluster the text,and keywords are extracted from the clustering results to form topics.Finally,the heat ranking strategy of the topic was developed,and the parameters were set flexibly according to the actual situation,and the heat ranking of the constructed topic was conducted.The main works of the paper are as follows:1.Through the study and analysis of the advantages and disadvantages of the traditional kmeans algorithm,an improved k-means clustering algorithm is proposed.Combining the idea of hierarchical clustering,the method of selecting the initial centroid is optimized to improve the rationality of the initial centroid position and remove the empty clusters in time.In addition,Gaussian kernel function is used as the ranging method to divide data in high-dimensional space.The improved algorithm greatly improves the accuracy of clustering,improves the defects of the traditional k-means algorithm,and avoids the empty cluster in the clustering result.2.Preprocessing the text,removing text noise,using the jieba Chinese word segmentation tool to accurately segment the text,and filtering the stop words.Doc2 Vec space vector model is used for text modeling and text vectorization.While transforming the text into vector representation,the word order information of the original text is preserved as much as possible,which improves the accuracy of the text vector model to a certain extent.3.According to the existing definition of the topic,extract some keywords from similar texts and analyze them,and effective words are selected to form the topic.Considering the total amount of text in the topic,the latest release time of the text,the duration of the topic,the latest growth of the text,the total number of forwards and comments,the weight is flexibly allocated,and the heat of the topic is evaluated and sorted.After the research in this paper,the improved k-means algorithm is used to cluster the text,which can more efficiently and accurately summarize the topics from the clustering results.Then the hot topics were sorted according to the heat ranking strategy based on the topic characteristics and user engagement,which effectively highlighted the hot topics that were widely concerned.
Keywords/Search Tags:k-means, kernel function, text clustering, topic sort
PDF Full Text Request
Related items