Example: Two-armed RCT

Broglio et al. (2014) presented an example from a hypothetical trial. We will use a similar setup for this example, and break up the pieces to make clear the argument choices for the package.

The setting is a two-arm randomized trial where patients are equally randomized to either a control or a treatment arm. The primary endpoint is overall survival (OS) measured from the date of randomization to the date of death from any cause or last follow-up. The expected OS rate at 12-months for the control arm is 30%. The minimum sample size is 100 and the maximum sample size is 300. For simplicity, it is assumed that there is no attrition. The maximum follow-up period for each subject is 12-months. (Note: this is slightly different from Broglio et al. (2014), as they assumed a follow-up period of 12-months on all subjects after accrual is complete.) Thus, if accrual is stopped early for predicted success or the trial continues accrual to the maximum sample size of 300 patients, the primary analysis of OS will be conducted after each subject has completed 12-months of follow-up.

From this information, we have:

  • Equal randomization: block = 2 and rand_ratio = c(1, 1) (default parameters)
  • Primary endpoint is at 12-months: end_of_study = 12
  • 12-month event rate for control arm: hazard_control = prop_to_haz(1 - 0.30, endtime = 12) (note that the input argument is the failure proportion, not the survival proportion)
  • No change points in hazard: cutpoint = 0 (default parameter)
  • Maximum sample size: N_total = 300
  • No attrition: prop_loss = 0

Sample size selection analyses are planned starting when 100 patients are enrolled and after every additional 25 patients are enrolled. Early stopping for futility is allowed starting with the 100 patient sample size selection analysis and Fn is 10%. Stopping accrual early for predicted success is only allowed starting with the 200 patient sample size selection analysis and Sn is 90%. It is expected that an average of 5 patients per month will be enrolled, with no change in speed for the duration of the trial.

From this information, we have:

  • Interim sample size looks: interim_look = seq(100, 275, 25)
  • Futility probability thresholds: Fn = rep(0.10, 8)
  • Predicted success probability thresholds: Sn = c(1, rep(0.9, 7)).
  • lambda = 5 and lambda_time = NULL (default parameter)

Note that the first value of Sn is 1. This is because the trial is not allowed to stop for predicted success at the first interim analysis of n = 100. The remaining elements of Sn are 0.9, corresponding to 90%.

The primary analysis is a two-sided log-rank test, with success declared at the α = 0.05 level.

From this information, we have:

  • Two-sided log-rank test used: alternative = "two.sided" and method = "logrank"
  • α = 0.05 level used to declare success: prob_ha = 0.95

Note that prob_ha is set as 1 − 0.05. This allows us to interchange between test, including Bayesian tests (method = "bayes"), which requires an analogous posterior probability threshold. The parameter h0 is ignored when using a log-rank test, as it is not meaningful to have success margins.

The operating characteristics will be determined using 500 simulated trials. At each interim analysis, we will use 100 imputations and assume independent weakly-informative Gamma(0.1, 0.1) prior distributions for the treatment and control arm event time hazard rate parameters. As this is computationally expensive overall, we will exploit the option to parallelize the simulations over multiple cores.

  • Number of simulated trials: N_trials = 500
  • Number of imputations from predictive distribution: N_impute = 100
  • Independent prior distribution for each hazard rate parameter: prior = c(0.1, 0.1)
  • Parallel computation: ncores = 8 (note: this is not currently possible on Windows machines)

Similar to above, the parameter N_mcmc is not required when using a log-rank test, meaning we do not need to enter a value for this argument. Since we do not allow for attrition, the data at the final analysis will be complete, and we can set imputed_final = FALSE. If attrition occurred and we planned to impute the final analysis dataset, we could change this to imputed_final = TRUE.

Initially, we want to determine the power to detect a significant treatment effect when the OS rate at 12-months for the treatment arm is 50%.

library(goldilocks)
#> Loading required package: survival
hc <- prop_to_haz(0.7, endtime = 12)
ht <- prop_to_haz(0.5, endtime = 12)

out_power <- sim_trials(
  hazard_treatment = ht,
  hazard_control = hc,
  cutpoint = 0,
  N_total = 300,
  lambda = 5,
  lambda_time = 0,
  interim_look = seq(100, 275, 25),
  end_of_study = 12,
  prior = c(0.1, 0.1),
  block = 2,
  rand_ratio = c(1, 1),
  prop_loss = 0,
  alternative = "two.sided",
  Fn = rep(0.10, 8),
  Sn = c(1, rep(0.9, 7)),
  prob_ha = 0.95,
  N_impute = 100,
  N_trials = 500,
  method = "logrank",
  ncores = 8)

The simulations take approximately 3 minutes to run on 2 GHz Quad-Core Intel i5 MacBook Pro.

It is straightforward to calculate the type I error under this design. The only change required is to set the hazard_treatment argument to the same as the hazard_control argument (i.e. the null case). We can make use of the update() function to avoid having to type everything else over again.

out_t1error <- update(out_power, hazard_treatment = hc)
summarise_sims(list(out_power$sims, out_t1error$sims))
#> # A tibble: 2 × 8
#>   scenario power stop_success stop_futility stop_max_N mean_N  sd_N
#>      <int> <dbl>        <dbl>         <dbl>      <dbl>  <dbl> <dbl>
#> 1        1 0.934        0.916         0.04      0.0440   171.  53.3
#> 2        2 0.068        0.056         0.846     0.098    210.  47.6
#> # ℹ 1 more variable: stop_and_fail <dbl>

The type I error under this design is slightly too large to be considered acceptable. This was to be expected, since we kept the P-value threshold as 0.05 despite having multiple interim looks. However, we note that only simulated N_trials = 500 trials, meaning if the type I error was truly 0.05, then values in the interval (0.05 + c(-1, 1) * 1.96 * sqrt(0.05 * (1 - 0.05) / 500)) would be consistent with this.

In practice, we need to use a more stringent threshold in order to control the overall type I error. This can be achieved by trial and error. For example, if we use P < 0.04 (applied using the argument prob_ha = 0.96), we find the operating characteristics are more acceptable.

out_power2 <- update(out_power, prob_ha = 0.96)
out_t1error2 <- update(out_power2, hazard_treatment = hc)
summarise_sims(list(out_power2$sims, out_t1error2$sims))
#> # A tibble: 2 × 8
#>   scenario power stop_success stop_futility stop_max_N mean_N  sd_N
#>      <int> <dbl>        <dbl>         <dbl>      <dbl>  <dbl> <dbl>
#> 1        1 0.92         0.89          0.044      0.066   175.  56.2
#> 2        2 0.048        0.042         0.868      0.09    203.  46.8
#> # ℹ 1 more variable: stop_and_fail <dbl>

Assuming the treatment arm has an OS rate of 50% at 12-months, the trial would be expected to stop early 89% of time, with an average sample size of 175. Overall, the power is 92%. Conversely, if the treatment arm OS rate is the same as the control arm, 87% of the trials stopped early for expected futility.

Once we have identified a suitable design, we would typically re-run the simulations using a larger number of simulations and, perhaps, imputations.

References

Broglio KR, Connor JT, Berry SM. Not too big, not too small: a Goldilocks approach to sample size selection. Journal of Biopharmaceutical Statistics, 2014; 24(3): 685–705.