Font Size: a A A

Research And Implementation Of Text Clustering Based On AP Algorithm

Posted on:2015-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y J WeiFull Text:PDF
GTID:2308330482452428Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Internet produces plenty of text documents, videos and images every day, among of them text takes a more and more important position as one of the forms of information. The rapid increase of the amount of text leads to serious text redundancy and structural complexity among texts, which makes it hard to find useful information from plenty of texts, and then comes the text clustering techniques. Text clustering is used to group text documents into different clusters. It has been used to many fields, such as analyzing microblogs and clustering news. Designing text clustering process with high accuracy becomes very meaningful.This thesis implements the full process of text clustering on Hadoop platform and chooses AP as the clustering algorithm to achieve the goals. AP (Affinity Propagation) algorithm has many advantages over other clustering algorithms. The core contributions of our approach include three points:(1) The thesis implements the pre-process of text clustering on Hadoop, which improves the efficiency of text clustering.(2) The thesis formulates a partition rule for participating documents, which combines the known TF-IDF (Term Frequency-Inverse Document Frequency) information with thesaurus.(3) The thesis implements AP algorithm with the parallel abstraction MapReduce framework and optimizes the algorithm during the process. The full process of text clustering will decrease the network and improve the ability of data processing and the efficiency of AP algorithm.This thesis finds that parallel text clustering is fit for big text collection. The study of AP algorithm shows that some good optimization strategies can be used to improve the efficiency. In the future, we will focus on more optimizing stragegies.
Keywords/Search Tags:text clustering, AP algorithm, Hadoop, Big data, MapReduce
PDF Full Text Request
Related items