| Topic detection is an important method for processing Internet news data.The main task is to automatically detect and organize potential topic information from the news data,effectively collect and organize the scattered information in the network,help people find unknown topic events from much data,and enable people to understand the event as a whole details,effectively solve the problem of information overload.In the topic detection task,the text clustering idea is an effective solution.The topic detection model based on the text clustering idea mainly includes data acquisition,feature selection,text modeling,and clustering strategy.Selection and text modeling represent research on news topic detection.(1)Since the original news text has many noise features,the unsupervised feature selection method has limited feature selection capabilities,and the supervised feature selection method cannot be directly used for topic detection tasks.This paper proposes a feature selection based on multiple K-means clustering results method(FS-MKCR).This method uses the characteristic that K-means clustering results depend on the number of clusters and the selection of the initial center point.By using supervised feature selection methods to filter noise features on K-means clustering results under different initial conditions,the most Excellent feature subset.This method applies the supervised feature selection method to the unsupervised learning task of news topic detection,which greatly reduces the influence of noise features and effectively improves the accuracy of topic detection.(2)In order to obtain feature words with strong topic recognition ability,the concepts of inter-class concentration and intra-class dispersion are introduced to improve the expected cross entropy,and the improved expected cross entropy is combined with FS-MKCR to construct news topic features Extract the model.While considering the correlation between features and categories,this method pays attention to the distribution of the distribution of feature items within and between categories,so that the result of feature selection tends to features with strong topic recognition ability.Further improve the FS-MKCR's ability to select news topic features.(3)Considering the role of semantic information and the influence of features on different texts in topic detection,the Word2 vec model and TF-IDF are combined to achieve a modeled representation of news texts.This method uses Word2 vec to obtain the semantic correlation relationship implied in the document,while retaining the importance of each feature word for different news texts through TF-IDF,so as to obtain a more accurate text representation and further improve the detection of news topics accuracy. |