| Background: The goodness of fit of a statistical model is commonly assessed by describing how well the model fits the observed data. For logistic regression the Hosmer-Lemeshow goodness-of-fit test compares the number of expected events from the logistic regression model to the number of observed events within deciles of predicted probabilities. This research evaluates two translations of the Hosmer-Lemeshow goodness-of-fit test for logistic regression to the Cox proportional hazards model, the Cook-Ridker (CR) and the D'Agostino-Nam (DAN) tests. These translations are compared to a test which was designed specifically for survival data, the Grønnesby and Borgan (GB) test. The GB test uses martingale residuals to compare the count of events to the semi-parametric estimates from the Cox proportional hazards model on a cumulative hazards scale. In contrast, the CR and DAN translations compare the non-parametric Kaplan-Meier estimate and the semi-parametric Cox proportional hazards estimate of survival at a fixed time. Methods: The sizes of these tests are investigated by simulating survival data and varying the baseline hazard function (exponential, Weibull and log-logistic), effect size, percentage of censoring, sample size, number of groups, and the choice of fixed time point (for the CR and DAN tests). Results: The sizes of the CR and DAN tests are near the nominal level in very few of the simulated scenarios. For most scenarios the CR and DAN tests have a size that is either much larger or much lower than the nominal level. However, when using half the maximum simulated time as the fixed time point the sizes of the CR and DAN test are near or closer to the nominal level in more scenarios compared to when the maximum time point is used. In addition, numerical issues can occur when the estimated survival probability is zero and when the estimated expected number of events is either close to zero or close to one. These results also expand on previous simulation studies showing that the size of the Grønnesby and Borgan test is notably above 0.05 in larger sample sizes (1000 or more). Conclusions: Although the CR and DAN translations of the Hosmer-Lemeshow goodness-of-fit test to the Cox proportional hazards regression are conceptually intuitive they appear to have an incorrect size and numerical issues can occur. The Grønnesby and Borgan test should be used instead since it has a more appropriate size when used with the correct number of groups. |