Font Size: a A A

The Research On Medical Data Parallel Clustering Algorithm Employing MapRedcue

Posted on:2017-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:J B XiaoFull Text:PDF
GTID:2428330488979903Subject:Computer technology
Abstract/Summary:PDF Full Text Request
When facing massive statistical data,the traditional k-means clustering algorithm needs a lot of distance calculation and memory space to deal with these data,resulting in time and space complexity is too high and can not meet the requirement of massive data analysis.Moreover,with the rapid development of Internet technology,accumulated a lot of medical data in the medical field,how to use data mining techniques accurate and fast to extract the information of people interesting become a hotspot in the current research.So the research emphasis of this paper is more rapidly under the condition of guarantee the accuracy of extracting information from the massive medical data.This paper proposes an improved parallel k-means clustering algorithm which is based on the MapReduce distributed parallel computing framework to conduct clustering analysis and excavate the relationship between the disease and drug for medical data.The detail of work is as follows:First,this paper analyzes the defects of the traditional k-means clustering algorithm which is mainly manifested as in each iteration process requires a lot of redundant distance calculation.Therefore,it puts forward a simplified model to simplify the distance calculation between the non clustering centers and other clustering central point.According to the clustering process in the presence of the extreme points,this paper proposes to calculate the distance between the extreme points and the center by the Manhattan distance instead of Euclidean distance,and reduce the distance computation between them.Second,the center of the k-means clustering algorithm has a great influence on the clustering results.So selecting k records from a database as the centers in the first iteration and each record represents only a disease.The rest iteration is through calculate the average of all points to choose new centers and can ensure the accuracy of the final clustering results.At last,according to the improved strategy which was proposed in the paper,this paper provides an improved parallel k-means clustering algorithm based on MapReduce.In order to borrow from the open source implementation of MapReduce parallel computing framework,it implements and compares the improved k-means clustering algorithm employing MapRedcue parallel computing framework,k-means clustering algorithm in Mahout(Mahout-KCA)and the normal parallel k-means clustering algorithm based on MapReduce(MR-KCA).Finally,the experimental results illustrate that IMR-KCA is more reliable,efficient and scalability than the other similar algorithms.
Keywords/Search Tags:clustering algorithms, k-means, MapReduce, medical data, redundant distance
PDF Full Text Request
Related items