Breast Cancer Risk Prediction Based On Apache Spark

Posted on:2020-05-22

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Diao

Full Text:PDF

GTID:2404330590495659

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

At present,IT industry,computer technology and artificial intelligence technology have developed rapidly in the past decades.At present,the Internet of Things information industry has developed rapidly in recent years,all of which have resulted in the continuous growth of information.Especially in the field of medical industry,the explosive growth of medical data has established a huge medical database,which has potential practical value.With the continuous development and maturity of big data analysis technology represented by in-depth learning,there are signs that big data analysis technology has begun to be deeply integrated with the field of health care.Based on the large data Spark platform,this paper carries out disease prediction research in the field of breast cancer,and explores the application of large data analysis technology in breast cancer disease prediction.Firstly,we use simple data mining techniques,such as propensity score matching,chi-square validation,KM survival analysis and Cox regression,to effectively analyze clinical data.The survival curve of the patients was obtained by grouping the age of the patients and whether the operation was performed or not.It was found that age was not the main factor in the number of months of survival and had little effect.Surgical methods play a leading role,especially in patients with simultaneous removal of primary and metastatic lesions,the longest survival months.Secondly,this paper establishes a negative or positive predictive analysis of patients through large data spark platform and random forest algorithm.Experiments show that Perimeter,Texture and Concave points have a greater impact on the pathogenesis of breast cancer and are more likely to lead to positive occurrence among the relevant nuclear parameters of breast cancer pathogenic cells.The prediction accuracy of the model established in this paper can reach 99.76%.It has high accuracy and reliable method,and has certain practical application value.The results of the final experimental study have a certain degree of reference significance for breast cancer risk detection.Then,this paper establishes the prediction analysis of negative or positive patients based on SVM algorithm model,and gets 87.8% prediction accuracy.By comparing the prediction accuracy of the two algorithms,it is found that the stochastic forest algorithm is better than the SVM algorithm.Finally,this paper establishes a risk-killing model based on vector machine SVM and random forest algorithm,and studies and compares the advantages and disadvantages of the two algorithms.The experimental results show that the prediction accuracy of the model based on vector machine is74.6%,and that of the model based on Stochastic Forest algorithm is 75.5%.At the same time,the area of the Stochastic Forest under the two prediction curves is 0.796 larger than 0.615 of the support vector machine svm,so the stochastic forest algorithm is more valuable in practical application.

Keywords/Search Tags:

Apache Spark, Random forest, Support Vector Machine, Disease prediction, machine learning, Big Data Analysis

PDF Full Text Request

Related items

1	Analysis Of Cancer Gene Data Base On Random Forest And Support Vector Machine
2	Epileptic Seizure Prediction Algorithm Based On EEG Signals
3	Research Of Prediction For Brucellosis Based On Machine Learning Method
4	Research On Risk Prediction Of Diabetes Based On Random Forest And Support Vector
5	Analysis Of The Effect Of Various Machine Learning Algorithms In Predicting Hospitalization Cost Of Respiratory Diseases
6	Research And Implementation Of Facial Skin Quality Evaluation Method Based On Machine Learning
7	Risk Prediction Of Liver Cirrhosis Complicated With Hepatic Encephalopathy Based On Cost-sensitive Random Forest And Support Vector Machine
8	The Research On The Prediction Of Cardiovascular Diseases Based On Random Forest And Support Vector Machine
9	Theoretical Prediction Of Drug Toxicity Based On Machine Learning Approaches
10	Research On The Image Classification Of Brain Glioma Based On Machine Learning