Font Size: a A A

Using K-Mean And SVM To Build Hybrid Methodology To Classify Diseases

Posted on:2018-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:AL-MUREISH NEZAR MOHAMMED GALIFull Text:PDF
GTID:2428330545950589Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the large amount of data stored in files and databases such as scientific knowledge,financial data,marketing data,demographic data,and medical data etc.,it is increasingly significant to improve powerful means for analysis,interpretation and extraction of interesting knowledge that could help in decision-making.Data mining shows theories,techniques,and tools for processing huge volumes of data.Medical data is a remarkable application for data mining process because it has large data,high-dimensional data,noise and data provenance.Data mining's goal is to obtain optimal medical results from the medical data profile to help physicians and researchers.In this thesis,a hybrid approach is offered in order to classify the Medical diseases: Breast cancer,Lung cancer and Heart disease.High-dimensional data in medicine are contributory in the computation,it makes the implementation of model and/or pattern classifier completely difficult and sometimes impossible.Our approach in this thesis called Clustering Support Vector Machines(CSVM)uses clustering(K-means)and Statistical filtering(ANOVA test)as a preprocessing step and Support Vector Machines algorithm to classify diseases related to medical data.In Clustering process,we collected the data together in three clusters,we try to make the data more consistence to get the best result from our proposed methodology.We detected Outliers values and others values,by clustering,where similar values are organized into clusters.Then after clustering the similar features together,we used the ANOVA test to select only the significant features with ?(P-value)= 0.01.In the classification process we want to identify and classified patients as having a specific disease or not.We proposed Gaussian as kernel function=100 in the process of building SVM classifier.Our proposed nonlinear method was with Gaussian radial basis function,because it resembles the sigmoid kernel for certain parameters and it requires less parameter than a polynomial kernel.The kernel function parameters which controls the complexity of the decision function versus the training error minimization,can be determined by running a two dimensional grid search,which means that the values of parameters are generated in a predefined interval with a fixed step.Finally we compared the performance of our methodology with three algorithms: decision tree(ID3),na?ve bayes,and support vector machine.First we applying clustering and ANOVA test as a preprocessing step with SVM Classifier as one method,then comparing the results from our method with different classification algorithms: decision tree(ID3),na?ve bayes,support vector machine.We obtained the highest accuracy with CSVM which was 99% with Heart diseases,compared to the three other classifiers: decision tree(C4.5),na?ve bayes,and support vector machine classifiers,also the CSVM was the less time consuming in the testing process.
Keywords/Search Tags:Data Mining, Clustering Support Vector Machines (CSVM), Medical diseases, Clustering (K-means), ANOVA test, decision tree (ID3), na?ve bayes, support vector machine(SVM)
PDF Full Text Request
Related items