Research On Text Clustering Based On Semi-supervised Learning

Posted on:2021-08-13

Degree:Master

Type:Thesis

Country:China

Candidate:H F Xu

Full Text:PDF

GTID:2518306032465094

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of the Internet,people can obtain a large amount of information,most of which exists in the form of text.In the real world,there are usually a small number of labeled samples and a large number of unlabeled samples.The generalization ability of supervised learning is not strong using only a limited number of labeled samples,and unsupervised learning cannot efficiently use unlabeled samples.Semi-supervised learning can use a small number of labeled samples and a large number of unlabeled samples to improve learning performance.Therefore,it is of great significance to study semi-supervised text clustering.Common text clustering algorithms cannot efficiently use unlabeled samples for clustering,and the feature vector affects the clustering result.High feature dimension can result in poor clustering effect.In response to these problems,a text clustering algorithm based on semi-supervised learning is proposed.The algorithm first performs text preprocessing on the experimental corpus,uses the word2vec model to train the corpus,learns the semantic relationship between the words,and converts the text into a sparse original vector form.Then use the feature extraction model based on the convolutional neural network to extract the features of the original vector.The model uses a part of labeled samples to train the convolutional neural network,and then uses the trained convolutional neural network to extract features from the text vector.The model not only extracts important features,but also reduces the feature dimension.Finally,in order to solve the problem that the K-means algorithm depends on the initial clustering center,a semi-supervised learning method is used to determine the initial clustering center of the K-means algorithm using a small number of labeled samples.In order to avoid that a labeled sample may be isolated point,exclude this point by calculating the Mahalanobis distance between sample points.In this paper,experiments were conducted on the text data set 20 Newsgroups and the artificial data set respectively,and three clustering effect evaluation indexes of accuracy,NMI and F-Measure were used.The experimental results on the text dataset 20 Newsgroups show that the accuracy of the algorithm in this paper reaches more than 45%,the NMI value reaches more than 35%,and the F-Measure reaches more than 39%,which is better than other text clustering algorithms.The experiment on the artificial data set select 2012 Sohu news data.The experimental results show that the accuracy of the algorithm in this paper exceeds 90?,the NMI value exceeds 63%,and the F-Measure exceeds 80%.It can solve the problem of high-dimensional and sparse short text features.

Keywords/Search Tags:

Semi-supervised learning, Text clustering, Convolutional neural network, Feature extraction

PDF Full Text Request

Related items

1	Biomedical Entity Relation Extraction Based On Semi-supervised Learning And Deep Learning
2	A Study On Optimization Of Text Clustering Based On Convolutional Neural Network
3	Research On High Performance Chinese Text Classification Based On Machine Learning
4	Semi-supervised Learning On Text Data
5	Research On Key Problems In Text Mining
6	Semi-supervised Image Classification Based On Relationship Representation
7	Research Of Semi-supervised Face Recognition By Convolutional Neural Networks Based On Graph Clustering
8	Research On Semi-supervised Clustering And Classification Algorithm
9	Research On Text Clustering Algorithm Based On Deep Learning Feature Extraction
10	Research And Application Of Convolutional Neural Network In Collaborative Semi-Supervised Classification