Font Size: a A A

Research On Text Clustering Algorithm Based On Semantic Similarity

Posted on:2018-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:C T LiFull Text:PDF
GTID:2348330569986410Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the information age,people are submerged in the vast volume of information.The network brings us information resources.How to extract useful information becomes a problem which needs to be studied and solved.About 80% of the daily information that people receive is in the form of text,thus text mining attracts more and more attention.Since text clustering can be fully applied in real life and work,the text clustering methods have great value of research and application.The commonly used text clustering methods mostly adopt the text representation based on vector space model.But this kind of representation has the problems of high spatial dimensionality and sparseness,and the semantic information between words is not considered.As a result,the accuracy of clustering is not high.In order to solve these problems,this paper combines feature extraction and lexical semantics to calculate the similarity and uses density based method to cluster the text set.In this paper,I also adopt the bee colony algorithm to cluster the texts,the basic bee colony algorithm has two defects: First,the bees' s initial positions are assigned randomly,which may be unreasonable and makes the algorithm iterate too many times and reduce the efficiency;Second,the algorithm is inclined to converge to local optimum in the later stage of execution.The algorithm introduces the maximum and minimum distance method in the initial stage and makes the initial values set reasonable and evenly distributed.The K-means algorithm is adopted during the execution of the colony algorithm,and the cluster centers obtained in each iteration of the bee colony algorithm are updated and improved.This not only accelerates the running of the algorithm but also makes the algorithm better and more robust.In this paper,500 texts of five categories were randomly selected from the Chinese text corpus of Fudan University.The clustering results were evaluated by measuring the clustering accuracy,recall rate and F metric.Compared with the VS-based K-means algorithm and the K-means short text clustering algorithm combined with semantic improvement,these indexes are improved and the result is about 80%.This achieves the purpose of the text clustering improvement and proves the reasonability and validity of the algorithm.
Keywords/Search Tags:text clustering, semantic similarity, density based clustering, bee colony algorithm
PDF Full Text Request
Related items