Font Size: a A A

Research On English Text Clustering Method Based On Vector Space

Posted on:2020-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:P Q YangFull Text:PDF
GTID:2428330575965376Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology and the Internet,the expansion of text data has intensified,which brings great challenges to the classification of web text information.In the face of the diversity of massive text data in today's network.It is of great significance to explore the potential value of data,search available data information.The content of this paper is to cluster English texts.English text is very different from Chinese text.In the former,words are separated by spaces or punctuation marks.In the latter,sentences are made up of consecutive words.The first steps of English text processing are word segmentation,word stopping and stem extraction.Keeping the feature words that are validly in the text.However,the preprocessed text still can't be used for clustering analysis,because structured method is needed to process the unstructured text data.In this paper,Vector Space Model based on algebra theory is selected,which can transform preprocessed text into the form of feature and weight set.The method is used to transform the feature collection into vector form.The VSM model is easily to understand and can obtain the data form that processed by computer.However,the model has some shortcomings:Each text in the text set is composed of a large number of features,the sparsity and dimension of the text vector are too high,which is difficult to calculate similarity of text.The relationship between words is independent,which brings negative influence to text clustering.In view of the above problems,the main research content of this paper is as follows:In the first part,because the high-dimensional and sparsity of text vectors.It is more difficult to calculate the similarity.An improved similarity calculation method is proposed and it can accurately obtain the similarity value.To a large extent,it overcomes the computational inaccuracy caused by the above problems.At the same time,the Random Walk method and Stack Denoising Automatic Encoder are adopted to improve the anti-interference and weak boundary partition ability of similarity matrix.It can obtain the deeper feature representation of matrix.The algorithm has better robustness.Finally,K-Means algorithm is selected for cluster analysis.The second part,we extend to English short text clustering.The short text has such characteristic as fewer words,"loud noise",non-compliance with language regulations,stronger expression ability of individual words and more sensitive to the relationship between words.First of all,we analyze the characteristics of the short text with few words and strong expressive ability.The TF-IDF will weaken the expression ability of characteristic words and increase the sparsity of text vectors.The word frequency statistics is used as the vector representation of the short text,which can keep the text content simply and effectively.To some extent,it alleviates the problem caused by sparse words.Then,we based on experimental evidence,some feature will bring unreasonable co-occurrence in singular value decomposition algorithms.This paper presented a Word Document Frequency method for feature filtering.Next,the words are independently each other in the texts.By using the singular value decomposition method,the latent semantic relations between words will be mined.The purpose of denoising and dimension reduction is achieved under the condition of keeping the original content.Finally,because the large difference in the short text,the K-Means method is sensitive to some "noise data".The larger values will distort the distribution of the data.In this paper,the improved K-Medoids method was used for cluster analysis.It choose the object at the center of the cluster to avoid bad data.In this paper,a simple and effective VSM model was selected to transform the original text into vector space.In view of the shortcomings of the model.According to different experimental data and choosing appropriate solutions.Experimental results show that the proposed method has a good clustering effect.
Keywords/Search Tags:Vector Space Model, English text clustering, Improved Similarity Algorithm, Singular Value Decomposition
PDF Full Text Request
Related items