With the advancement of medical informationization,medical data is increasing day by day.In this context,the traditional association rule mining algorithm has the problem of the algorithm running too long in medical big data.The advent of cloud computing platforms provides an effective solution to this problem.In this paper,the equivalent conversion Eclat algorithm in the association rules is studied and optimized,the R-Eclat algorithm is proposed,the Parallelization of R-Eclat is realized by using the Spark cloud computing framework,and the parallel algorithm is applied to medical big data.The main work was done:1.The study and optimization of the equivalence conversion Eclat algorithm.In view of the increasing scale of transaction sets in the database,the problem of time and spatial complexity will occur,and by using the a priori theorems in the association rules,an optimization scheme is proposed in the connection step of the Eclat algorithm,and some duplicate or infrequent item sets are reduced,and an improved algorithm,R-Eclat,is proposed.The effectiveness of the R-Eclat algorithm is verified by comparing the original algorithm on different types of public data sets.The R-Eclat algorithm has faster running time than the original algorithm,and the algorithm is up to 20% more efficient,and the optimization effect of the R-Eclat algorithm is more obvious on sparse data sets than the dense data set.2.Parallelization of The R-Eclat algorithm based on Spark RDD.In view of the problems existing in the serial environment of the algorithm,the parallelization scheme is proposed by Spark RDD operator,which adds a triangular accumulation matrix in the process of mining the intersection of frequent item sets by The R-Eclat algorithm,which optimizes the filtering operation of the candidate frequent item set.Then,in the construction of the Spark cluster,the parallelized R-Eclat algorithm is realized.By comparing the yaf IM algorithm based on Spark and changing the number of computing nodes of the cluster,the R-Eclat algorithm has some improvement in algorithm efficiency than the YAFIM algorithm,while the R-Eclat algorithm has good compute node extensibility in spark cluster environment.3.Parallelized R-Eclat algorithms are used in diabetes data sets.In view of the algorithm’s use of triangulation matrix as the characteristics of accumulators,the property items of the dataset are mapped to the corresponding item number table.The dataset is split into different sizes and compared to the algorithm in the serial environment.The experimental results show that the efficiency improvement effect of the algorithm is more obvious when the data scale is larger,and the correlation rules excavated show that the detection of glycifyding hemoglobin can determine whether diabeticpatients need to be sent to hospital again. |