Font Size: a A A

Breast Cancer Risk Prediction Based On Apache Spark

Posted on:2020-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:J Y DiaoFull Text:PDF
GTID:2404330590495659Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
At present,IT industry,computer technology and artificial intelligence technology have developed rapidly in the past decades.At present,the Internet of Things information industry has developed rapidly in recent years,all of which have resulted in the continuous growth of information.Especially in the field of medical industry,the explosive growth of medical data has established a huge medical database,which has potential practical value.With the continuous development and maturity of big data analysis technology represented by in-depth learning,there are signs that big data analysis technology has begun to be deeply integrated with the field of health care.Based on the large data Spark platform,this paper carries out disease prediction research in the field of breast cancer,and explores the application of large data analysis technology in breast cancer disease prediction.Firstly,we use simple data mining techniques,such as propensity score matching,chi-square validation,KM survival analysis and Cox regression,to effectively analyze clinical data.The survival curve of the patients was obtained by grouping the age of the patients and whether the operation was performed or not.It was found that age was not the main factor in the number of months of survival and had little effect.Surgical methods play a leading role,especially in patients with simultaneous removal of primary and metastatic lesions,the longest survival months.Secondly,this paper establishes a negative or positive predictive analysis of patients through large data spark platform and random forest algorithm.Experiments show that Perimeter,Texture and Concave points have a greater impact on the pathogenesis of breast cancer and are more likely to lead to positive occurrence among the relevant nuclear parameters of breast cancer pathogenic cells.The prediction accuracy of the model established in this paper can reach 99.76%.It has high accuracy and reliable method,and has certain practical application value.The results of the final experimental study have a certain degree of reference significance for breast cancer risk detection.Then,this paper establishes the prediction analysis of negative or positive patients based on SVM algorithm model,and gets 87.8% prediction accuracy.By comparing the prediction accuracy of the two algorithms,it is found that the stochastic forest algorithm is better than the SVM algorithm.Finally,this paper establishes a risk-killing model based on vector machine SVM and random forest algorithm,and studies and compares the advantages and disadvantages of the two algorithms.The experimental results show that the prediction accuracy of the model based on vector machine is74.6%,and that of the model based on Stochastic Forest algorithm is 75.5%.At the same time,the area of the Stochastic Forest under the two prediction curves is 0.796 larger than 0.615 of the support vector machine svm,so the stochastic forest algorithm is more valuable in practical application.
Keywords/Search Tags:Apache Spark, Random forest, Support Vector Machine, Disease prediction, machine learning, Big Data Analysis
PDF Full Text Request
Related items