Font Size: a A A

Significant Study Of Text Clustering Model Based On Machine Learning

Posted on:2020-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:L R SunFull Text:PDF
GTID:2428330611999587Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the advent of the information age,clustering analysis of text information is a necessary task in big data era.Clustering is one of the most representative methods in machine learning,and feature extraction has a very important impact on the accuracy of the clustering results.At present,traditional clustering algorithms are mainly used to solve some low-dimensional data clustering problems.There are still some limitations to the processing of text information.Therefore,how to improve the accuracy of the clustering algorithm through feature engineering is worthy of discussion.From the perspective of application engineering,this paper proposes an improved feature extraction model for the sparseness of the movie review text data set,and combines two traditional clustering algorithms for experimental comparison.Then based on the improved feature extraction model,a chase clustering algorithm model is proposed,which effectively improves the accuracy of the clustering results.Because there are also differences between texts,different feature extraction models are suitable for different text databases,so this paper selects t wo different types of text datasets for clustering,and applies the improved clustering algor ithm to movie review clustering.This paper presents an improved new S-T feature extraction model based on the skip-gram model and TF-IDF model,and combines the S-T feature extraction model with traditional clustering algorithms for clustering.In the S-T feature extraction model,TF-IDF is used to learn the importance of feature words.Through the unsupervised learning method of word2 vec method,neurons in hidden layers in the skip-gram model are used to learn the text information of each feature word.The negative sampling method is used to improve the performance,and then the weight relationship in the neural network is changed.The maximum likelihood estimation method is used to convert the problem into a stochastic gradient ascent method to solve the problem of sparse feature of the original text.Through comparative experiments,the significance evaluation index value indicates that the clustering effect of spectral clustering is better than k-means clustering when text clustering is performed on low-dimensional feature words in the ancient poetry dataset.For high-dimensional and multi-feature movie review text datasets,k-means clustering is better than spectral clustering,and the improved S-T model algorithm with the best average accuracy is improved from the original value 43.6% to 63.28%.Because the Bayesian model has the advantage of automatically learning the number of clusters from the text data set,this paper proposes a chasing clustering model based on the Bayesian method.Considering the characteristics of the feature items in the text clustering,the conditional probability in the Bayesian formula will be changed in the chasing clustering model.By adding a small constraint,which is the chase center point,the data is self-learned and clustered.Through comparative experiments,the dimensionality reduction clustering effect map and saliency evaluation index value show that the clustering effect of the improved chase clustering algorithm is better than the model combining S-T model and traditional clustering.
Keywords/Search Tags:text clustering, feature extraction, Bayesian method, clustering algorithm
PDF Full Text Request
Related items