Font Size: a A A

Text Clustering Based On Center Point Selection And Deep Representation Learning

Posted on:2019-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:B Y WangFull Text:PDF
GTID:2428330566970925Subject:Mathematics
Abstract/Summary:PDF Full Text Request
With the advent of the Big Data Era,textual information is produced in abundance and rich in value.How to use such textual information rationally has become an opportunity and a challenge for people.Text clustering,as one of the important methods of text data mining,is to classify a large number of disorderly and disorganized articles by similarity judgments,so that articles with similar topics or contents are grouped into the same category,while unsimilar articles will be divided into different categories.As an unsupervised machine learning method,text clustering,which does not require a training process and does not require manual labeling of texts in advance,has certain flexibility and a high degree of automation.It has become an important means for effective organization,summarization and navigation of textual information and the focus of more and more researchers.This paper focuses on the research and improvement of text-oriented clustering algorithm and proposes a more efficient clustering algorithm.Based on this,it further studies the text clustering method based on deep learning text representation and proposes a new text clustering method.The main work of the full text is as follows:1.In the text clustering algorithms,K-means algorithm based on cosine similarity has always been one of the most important and most widely used due to its simplicity,fast convergence speed,and ability to efficiently process large data sets.For the problem that the initial point of the K-means algorithm based on cosine similarity is not easy to select,the relationship between the cosine similarity and the Euclidean distance is discussed,and the conversion formulas of the two are obtained under the premise of the standard vector,and on this basis,a cosine distance with closely related Euclidean distance meanings makes the original K-means improvement method based on Euclidean distance migrate to the K-means algorithm based on cosine similarity through cosine distance.Then,based on this theory,the in-cluster center point calculation method of the cosine K-means algorithm and its extended algorithm is derived,and the selection scheme of clustering initial cluster centers is further improved to form a new text clustering algorithm MCSKM++.Experiments show that the algorithm improves the clustering accuracy while reducing the number of iterations and shortening the running time.2.For the text data to be clustered,there is no label that can not be trained in the depth representation model of the text,a method of text clustering based on depth representation learning is proposed by using the method of adapting the transition learning domain andupdating the parameters of the cluster iteration process.First,the model uses source domain data to perform pre-training of the deep learning classification model as an initialization of the model parameters;after that,the domain discriminator is added to the model,and the input sample is domain-divided by the domain discriminator,when the discriminator cannot distinguish the domains to which the data belong,the public feature space of the two domains is obtained and the domain adaptation problem is solved.Finally,the eigenvectors obtained by the model are clustered.The clustering iteration process is optimized by the maximum expectation algorithm,and the model parameters are continuously adjusted during the cluster iteration process to make it more suitable for the data characteristics of the target domain,the objective function gets text clustering results when it converges.Experiments show that the algorithm's clustering accuracy is superior to similar algorithms.
Keywords/Search Tags:Text Clustering, K-means Algorithm, Cosine Similarity, Deep Learning, Text Representation, Transfer Learning
PDF Full Text Request
Related items