Font Size: a A A

Research On Text Clustering Algorithm Based On Sentence Vector And Convolutional Neural Network

Posted on:2023-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:H Z WangFull Text:PDF
GTID:2568306848981379Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,network data and information resources are increasing day by day,resulting in a large number of disordered text data.How to immediately and efficiently acquire useful message on related topics from a large number of texts has become a significant assignment of network public opinion.Text clustering becomes one of the important methods to achieve this goal.It belongs significant branch in the domain of text mining,and is applied to information retrieval,personalized recommendation,text organization and so on.Before clustering the text,a text representation model needs to be selected to transform the original text data into information that can be recognized by the computer.However,during the conversion process,the text feature dimension is too high,not considering the order and semantic relationship between words in the text;After the text is transformed,the K-means clustering algorithm needs to be used for clustering to verify whether the text features extracted by the text representation model are reasonable and can represent the actual content of the text.Using the K-means clustering algorithm for clustering,it is necessary to manually set the k value and the initial center point in advance,so that the clustering effect is not good.This paper mainly studies the above problems.(1)In order to solve the problems of high text feature dimension and ignoring document word order and semantics in text clustering,this paper proposes a DM-CNN(Distributed Memory Convolutional Neural Networks)text representation model for text clustering,which combines sentence vectors and convolutional neural networks.First,use the DM model to convert the text in the data set into sentence vectors,fully considering the order and semantics of document words;Then use CNN to extract the deep semantic features of the text,solve the problem of high feature dimension,and obtain the text feature vector that can be used for clustering;Finally,use the K-means algorithm for clustering.According to the experimental results,on the data set of Sogou news,the accuracy of text representation model proposed in this paper reaches 0.776,and the F-value index reaches 0.780,which is improved compared to other text representation models.(2)For the sake of settling text clustering,the K-means algorithm needs to set the k value and the initial center point manually in advance,so that the accuracy is not high,and the clustering result may reach the local optimum.This paper proposes the CPKM(Canopy+K-means)algorithm.This algorithm introduces the Canopy+ algorithm on the basis of the K-means algorithm,which realizes the prediction of the k value and the initial cluster center,and makes the clustering result more accord with the practical category message of the data set.In the experimental part,different datasets are selected to prove that the CPKM algorithm presented in this paper is preferable to the K-means.The test results according,on the Sina news data set,the accuracy rate of the CPKM clustering algorithm in this paper reaches 0.730,and the F value reaches 0.720;On the Toutiao data set,the algorithm accuracy and F value are0.734 and 0.727 respectively.This paper proposes the DM-CNN text representation model and the CPKM clustering algorithm.Experiments are conducted on different data sets,and according to the accuracy of the evaluation index and the F value.It is testified that the DM-CNN model presented in this paper is effective for text representation,and can express text information more accurately.And the CPKM clustering algorithm has better clustering effect for the dataset.
Keywords/Search Tags:Text Clustering, Sentence Vector Model, CNN, Text Representation, K-means
PDF Full Text Request
Related items