The Research On Medical Data Parallel Clustering Algorithm Employing MapRedcue

Posted on:2017-01-26

Degree:Master

Type:Thesis

Country:China

Candidate:J B Xiao

Full Text:PDF

GTID:2428330488979903

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

When facing massive statistical data,the traditional k-means clustering algorithm needs a lot of distance calculation and memory space to deal with these data,resulting in time and space complexity is too high and can not meet the requirement of massive data analysis.Moreover,with the rapid development of Internet technology,accumulated a lot of medical data in the medical field,how to use data mining techniques accurate and fast to extract the information of people interesting become a hotspot in the current research.So the research emphasis of this paper is more rapidly under the condition of guarantee the accuracy of extracting information from the massive medical data.This paper proposes an improved parallel k-means clustering algorithm which is based on the MapReduce distributed parallel computing framework to conduct clustering analysis and excavate the relationship between the disease and drug for medical data.The detail of work is as follows:First,this paper analyzes the defects of the traditional k-means clustering algorithm which is mainly manifested as in each iteration process requires a lot of redundant distance calculation.Therefore,it puts forward a simplified model to simplify the distance calculation between the non clustering centers and other clustering central point.According to the clustering process in the presence of the extreme points,this paper proposes to calculate the distance between the extreme points and the center by the Manhattan distance instead of Euclidean distance,and reduce the distance computation between them.Second,the center of the k-means clustering algorithm has a great influence on the clustering results.So selecting k records from a database as the centers in the first iteration and each record represents only a disease.The rest iteration is through calculate the average of all points to choose new centers and can ensure the accuracy of the final clustering results.At last,according to the improved strategy which was proposed in the paper,this paper provides an improved parallel k-means clustering algorithm based on MapReduce.In order to borrow from the open source implementation of MapReduce parallel computing framework,it implements and compares the improved k-means clustering algorithm employing MapRedcue parallel computing framework,k-means clustering algorithm in Mahout(Mahout-KCA)and the normal parallel k-means clustering algorithm based on MapReduce(MR-KCA).Finally,the experimental results illustrate that IMR-KCA is more reliable,efficient and scalability than the other similar algorithms.

Keywords/Search Tags:

clustering algorithms, k-means, MapReduce, medical data, redundant distance

PDF Full Text Request

Related items

1	Research On Mapreduce Based Big Data K-means Clustering Algorithm
2	Research On Accelerating Of K-means Clustering Algorithm Using FPGA Based On MapReduce
3	Research On Parallelization Of K - Means Clustering Algorithm Based On MapReduce
4	Parallel Clustering Algorithm Based On MapReduce
5	Research On Parallelization Of Clustering Algorithm Based On Mapreduce
6	Research On Parallelization Of Clustering Algorithm Based On MapReduce
7	Improved K-means Clustering Algorithm Based On MapReduce Framework
8	Research, Design And Application Of Clustering Algorithm Using Mapreduce
9	Research Of K-means Clustering Algorithm Based On MapReduce
10	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform