Font Size: a A A

Research On Text Clustering Based On Semi-supervised Learning

Posted on:2021-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:H F XuFull Text:PDF
GTID:2518306032465094Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet,people can obtain a large amount of information,most of which exists in the form of text.In the real world,there are usually a small number of labeled samples and a large number of unlabeled samples.The generalization ability of supervised learning is not strong using only a limited number of labeled samples,and unsupervised learning cannot efficiently use unlabeled samples.Semi-supervised learning can use a small number of labeled samples and a large number of unlabeled samples to improve learning performance.Therefore,it is of great significance to study semi-supervised text clustering.Common text clustering algorithms cannot efficiently use unlabeled samples for clustering,and the feature vector affects the clustering result.High feature dimension can result in poor clustering effect.In response to these problems,a text clustering algorithm based on semi-supervised learning is proposed.The algorithm first performs text preprocessing on the experimental corpus,uses the word2vec model to train the corpus,learns the semantic relationship between the words,and converts the text into a sparse original vector form.Then use the feature extraction model based on the convolutional neural network to extract the features of the original vector.The model uses a part of labeled samples to train the convolutional neural network,and then uses the trained convolutional neural network to extract features from the text vector.The model not only extracts important features,but also reduces the feature dimension.Finally,in order to solve the problem that the K-means algorithm depends on the initial clustering center,a semi-supervised learning method is used to determine the initial clustering center of the K-means algorithm using a small number of labeled samples.In order to avoid that a labeled sample may be isolated point,exclude this point by calculating the Mahalanobis distance between sample points.In this paper,experiments were conducted on the text data set 20 Newsgroups and the artificial data set respectively,and three clustering effect evaluation indexes of accuracy,NMI and F-Measure were used.The experimental results on the text dataset 20 Newsgroups show that the accuracy of the algorithm in this paper reaches more than 45%,the NMI value reaches more than 35%,and the F-Measure reaches more than 39%,which is better than other text clustering algorithms.The experiment on the artificial data set select 2012 Sohu news data.The experimental results show that the accuracy of the algorithm in this paper exceeds 90?,the NMI value exceeds 63%,and the F-Measure exceeds 80%.It can solve the problem of high-dimensional and sparse short text features.
Keywords/Search Tags:Semi-supervised learning, Text clustering, Convolutional neural network, Feature extraction
PDF Full Text Request
Related items