Design And Implementation Of Text Clustering Based On Vector Space Model

Posted on:2015-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:Q W Wu

Full Text:PDF

GTID:2298330434950550

Subject:Electronic and communication engineering

Abstract/Summary:

As a very important branch in the field of data mining, Text clustering technology can classify the text information on the web effectively, it can not only help us to find useful information from the vast amounts of network information, but also improve the quality of service of network.In this article, the research is based on the web of Chinese text clustering. By means of the text clustering, the similar tests on the web can be classified. Because the Chinese text is usually composed of Chinese words or a word as a unit of the continuous statement, it is not like English text which uses the blank space as a boundary mark, before the clustering of Chinese text, the whole sentence need to be divided into small vocabulary unit. In addition, part of the test which is not the key words need to be removed, retaining the important part which can represent the text content.However, text clustering algorithms cannot be directly handled in the original Chinese text form, because the text content is used by human natural language, belonging to the unstructured text, and the computer is hard to deal with its semantic. Structured text processing is to translate the text into a model that the computer can understand. According to the characteristics of the text and text processing requirements, the appropriate text representation model is selected. In this article we use vector space model (VSM), because the VSM is said to context as characteristic vector and weight set, clustering operation is transforms the vector operations in vector space. At present there are many ways to text information which is converted to vector, and here we will choose classic feature weight calculation method based on vector space frequency-inverse document frequency (TF-IDF) algorithm on Chinese text structured processing, because TF-IDF is depicting the characteristics in the distribution of important degree of the whole text set.Although through vector transform, the text can be used in computer processing, but the text in the collection is composed of a large number of features, meanwhile it often has a high dimension, and will affect the effect of text clustering. Their respective text vector may exist in different vector space, making it difficult to calculate similarity. So we need to build the text clustering from the original feature space to another mapping characteristics of low dimension space.At this time, the characteristics should be optimized. Latent semantic analysis (LSA) of the singular value decomposition (SVD) not only can map the non-orthogonal multidimensional feature vector space model to the dimension of a few latent semantic space, but also can keep the original basic semantic features of the space, so as to realize the feature space dimensionality of noise reduction processing.Through the SVD of text we can use clustering algorithm for clustering. The current clustering algorithms can be divided into four different ways:the division method, the hierarchy method, the density method and the grid method. In these clustering methods, this paper chooses the Ordering points to identify the clustering structure (OPTICS) clustering algorithm based on density method. for the reason that compared with other clustering methods, this method can find different shapes of text cluster, and it can also filter outliers, and the web text clustering effect is better. Finally in cluster through single parameters exponential smoothing method to deal with the clustering results, the clustering results are more accurate. Through the experiment, this method is suitable for web text clustering analysis.

Keywords/Search Tags:

Chinese text clustering, Singular value decomposition, Vector spacemodel, OPTICS

Related items

1	Research Of Chinese Text Classification Based On Improved Vector Space Model
2	Research On English Text Clustering Method Based On Vector Space
3	Research On Text Clustering Based On Text Dimension Reduction And Ant Colony Algorithm
4	Chinese Text Clustering Based On Latent Semantic And Its Applications
5	Research On Image Denoising Algorithm Based On Image Self-similarity And Singular Value Decomposition
6	Research Of Network Hotspot Content Classification Based On Improved Singular Value Decomposition And Cosine Theorem
7	The Chinese Text Categorization Research Based On Support Vector Machine And Clustering Algorithm
8	Research On Chinese Text Clustering Of Neural Network Of Support Vector Machine
9	Study On Two-stage Chinese Text Clustering Based On Self-organizing Of Map
10	Research On Text Clustering Algorithm Based On Latent Semantic Indexing