Font Size: a A A

Design And Implementation Of Distributed Text Clustering System Based On K-means

Posted on:2019-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:C Y MaFull Text:PDF
GTID:2428330572950340Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,information resources increasing day by day.Extracting the valuable information from the massive data has became one of research hot spot.Text is one of important carriers of information.Along with the enterprise data quantity unceasing increase,it is hard for staff to find more relevant information quickly and accurately.The topic extraction and automatic classification processing of text data can provide a basis for text information retrieval and personalized recommendation.The traditional manual annotation methods can not adapt to the rapid growth of data and it cost a lot of labor.How to efficiently process large amouts of data is becoming a urgent problem which enterprises have to deal with.Therefore,it is crucial to rational design a distributed text clustering system.This paper first analyzes the basic theory of text clustering.And then designs and implements a distributed text clustering system based on the text clustering algorithm,parallel programming model and text clustering technology.In view of the increasing number of enterprise text data and the difficulty of effective application,the system efficiently and quickly implements feature extraction and automatic categorization of text data.The system used Spring MVC framework and used JSP to implement the presentation layer of the system.The control layer of system realized by the Spring MVC front-end controller Dispatcher Servlet.The business logic layer of the system is mainly composed of data source transmission module,text preprocessing module,text clustering analysis module,cluster result processing module and so on.In the text preprocessing module,designed and implemented the parallel process of segmentation,disuse word filtering,feature extraction and text vector space generation.The process realized the transformation of unstructured text data into structured text vector.In the text clustering analysis module,the K-Means clustering algorithm is applied to the distributed clustering analysis.With the problem of initial point randomness of K-Means clustering algorithm,the Canopy algorithm is proposed to optimize it.Finally,the scalability experiment,precision experiment and speedup experiment have been done to the parallel text clustering algorithm.The experiments show that the results is high efficiency after the parallelization.And then using Apache JMeter to test the performance of the system.The test results show that the system's response time and the number of concurrent users can meet the system non functional requirements.This paper designs and implements a distributed text clustering system based on the Hadoop platform and the Spring MVC framework.The system reduces the hardware requirements of text processing.When the enterprise staff manages a large number of text collections,the system can extract text topics and classify and manage text information without prior manual labeling,effectively reducing the human cost of text data classification management under large data volume.And It provides the basis for subsequent enterprise text information retrieval and personalized recommendation.Therefore,it is worth of further investigation.
Keywords/Search Tags:Data Mining, Text Clustering, K-Means, Parallelization, Cluster Analysis
PDF Full Text Request
Related items