Research On Text Clustering Algorithm Based On Semantic Similarity

Posted on:2018-01-19

Degree:Master

Type:Thesis

Country:China

Candidate:C T Li

Full Text:PDF

GTID:2348330569986410

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the advent of the information age,people are submerged in the vast volume of information.The network brings us information resources.How to extract useful information becomes a problem which needs to be studied and solved.About 80% of the daily information that people receive is in the form of text,thus text mining attracts more and more attention.Since text clustering can be fully applied in real life and work,the text clustering methods have great value of research and application.The commonly used text clustering methods mostly adopt the text representation based on vector space model.But this kind of representation has the problems of high spatial dimensionality and sparseness,and the semantic information between words is not considered.As a result,the accuracy of clustering is not high.In order to solve these problems,this paper combines feature extraction and lexical semantics to calculate the similarity and uses density based method to cluster the text set.In this paper,I also adopt the bee colony algorithm to cluster the texts,the basic bee colony algorithm has two defects: First,the bees' s initial positions are assigned randomly,which may be unreasonable and makes the algorithm iterate too many times and reduce the efficiency;Second,the algorithm is inclined to converge to local optimum in the later stage of execution.The algorithm introduces the maximum and minimum distance method in the initial stage and makes the initial values set reasonable and evenly distributed.The K-means algorithm is adopted during the execution of the colony algorithm,and the cluster centers obtained in each iteration of the bee colony algorithm are updated and improved.This not only accelerates the running of the algorithm but also makes the algorithm better and more robust.In this paper,500 texts of five categories were randomly selected from the Chinese text corpus of Fudan University.The clustering results were evaluated by measuring the clustering accuracy,recall rate and F metric.Compared with the VS-based K-means algorithm and the K-means short text clustering algorithm combined with semantic improvement,these indexes are improved and the result is about 80%.This achieves the purpose of the text clustering improvement and proves the reasonability and validity of the algorithm.

Keywords/Search Tags:

text clustering, semantic similarity, density based clustering, bee colony algorithm

PDF Full Text Request

Related items

1	Theory And Practice Of Hybrid Clustering Algorithm Based On Density And Ant Colony
2	Research On Text Clustering Based On Semantic Similarity
3	Research On Text Clustering Algorithm Based On Word Frequency And Semantic
4	Search Of Group Intelligent Text Clustering Methods Based On Semantic Similarity
5	Study On The Chinese Text Clustering Algorithm Based On Semantic Similarity
6	Research On Hybrid Ant Colony Clustering Algorithm
7	Study Of Text Clustering Algorithm Based On Semantics
8	Study On Similarity-based Text Clustering Algorithm And Its Application
9	Manifold Density Peak Clustering Algorithm And Its Application Of Weibo Text Classification
10	Clustering Algorithm Research Of Short Text Based On Semantic Similarity