Font Size: a A A

A Research Of Patients’ Survivability Prediction With Multiple Primary Cancers Based On Feature Selection Techniques

Posted on:2023-01-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:P LiuFull Text:PDF
GTID:1524307061452464Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Cancer is a serious disease with a very high fatality rate.The design of patients’survival prediction model based on data mining can provide theoretical support for clinical medical plans.Aiming at the problems of high dimensionality,small sample size and class skew of multiple primary cancer data,this paper designs five feature dimensionality reduction models suitable for different types of data from the perspective of data mining.Aiming at the problem of high-dimensional small sample size and class skew in the second primary lung cancer data,a new feature selection algorithm KPFS based on improved K-nearest neighbors and principal component analysis(PCA)was proposed.The KNN model is improved by increasing the category weight factor to improve the quality of training data;based on PCA,the feature selection is completed,which improves the data class skew at data processing level.Aiming at the insufficiency of KPFS algorithm on the disturbance of singular data,an improved feature dimension reduction model EBFS based on scatterdifference is proposed,which solves the disturbance of singular matrix to data.The problem of information mining based on intra-class and inter-class sample data is transformed into information mining based on overalland inter-classe,which improves the interference of large-category data on small-category data in information mining.Aiming at the lack of theoretical support of contribution degree left by KPFS and EBFS,an improved feature selection model IECFS based on undirected graph is proposed.The main advantage of the new model design is that for the skewness problem of training samples,the original ECFS data processing method based on the quantity product between vectors and the standard deviation level is transformed with the help of the data of the intra-class divergence and the inter-class divergence index.It is a vector-based data processing of intra-class dispersion and inter-class dispersion,and transforms the correlation between features into the importance of a single feature through matrix derivation,which provides theoretical support for feature confidence calculation.This IECFS model is suitable for SPLC survival prediction.Aiming at the problem of multiple primary cancer MPC data’s skewness problem,the data is adjusted from the perspective of data resampling.An improved model ICHI~2 based on the CHI~2 model is proposed.The advantage of model design is to optimize the CHI~2 model by adding positive and negative correlation information between features and classes while considering the frequency information of feature items in different categories.Weights are assigned correspondingly to improve the feature selection ability of the CHI~2 model.The new method improves the original CHI~2 model’s over-reliance on low-frequency feature items and reduces the influence of class skew and high-dimensional features on the classifier.Aiming at the high dimension of feature subset in the ICHI~2 model,an improved model Iinf FS based on the Inf FS method is proposed.The advantage of Iinf FS is that on the basis of mining the category information of the training set,the category distribution information of the data and the influence of the positive and negative correlation between the features on the classification effect are further considered.By introducing the concept of scatter,the scatter and correlation between features are weighted reasonably to measure the correlation coefficient between features,and the adjacency matrix is constructed.The use of the concept of inter-class scatter can better quantify the correlation between features.The algorithm can also be transformed into an unsupervised Uinf FS feature selection algorithm to improve the prediction efficiency of survival time prediction providing more information for auxiliary diagnosis.From the perspective of learners,an ensemble algorithm of multiple learners is proposed.The learners are integrated by means of soft voting.First,the weight of each basic learner is determined in a loop,and the output probability of the learners is integrated to obtain the final fusion result.The method is validated on SPLC,MPC and gastric cancer datasets.The results show that it can improve both survival rate prediction and survival time prediction.It is a generalizable learner ensemble algorithm.
Keywords/Search Tags:Data Mining, Feature Selection, Feature Extraction, Skewed Data, High-diemensional and Low-shot Problem, Multiple Primary Cancers
PDF Full Text Request
Related items