The goldilocks package implements the Goldilocks
adaptive sample-size design of Broglio, Connor, and Berry (2014) for
time-to-event endpoints. This vignette outlines the technical details of
the design, including notation, the piecewise-exponential event-time
model, Gamma posterior updating, posterior predictive
probabilities, interim decision rules, final analysis options,
and simulation-based calibration. The package vignettes “Two-arm
randomized trials”, “Bayesian piecewise-exponential designs”, and
“Single-arm designs with a performance goal” provide more
application-focused examples in R.
Consider a trial with maximum sample size \(N_{\max}\), planned endpoint time \(\tau\), and interim sample-size selection analyses after
\[n_1 < n_2 < \cdots < n_L < N_{\max}\]
subjects have been enrolled. The package argument
N_total corresponds to \(N_{\max}\), end_of_study to
\(\tau\), and interim_look
to \((n_1,\ldots,n_L)\).
Let \(A_i \in \{0,1\}\) denote treatment assignment for subject \(i\), with \(A_i = 1\) for the experimental arm and \(A_i = 0\) for control. In a single-arm design, all \(A_i = 1\) and there is no concurrent control. The package assumes that randomization occurs at enrollment; in practice those times can differ, but the distinction is not represented in the simulation model. We therefore use enrollment time throughout. Let \(E_i\) denote enrollment time from first patient in, \(T_i^*\) the true event time from enrollment, \(C_i\) the administrative or loss-to-follow-up censoring time, and
\[T_i = \min(T_i^*, C_i), \qquad \delta_i = I(T_i^* \le C_i).\]
At the interim analysis held when \(n_\ell\) subjects have been enrolled, the analysis calendar time is \(E_{n_\ell}\). Subject \(i\)’s current observed follow-up is
\[u_{i\ell} = \max\{0, E_{n_\ell} - E_i\}.\]
For enrolled subjects, the interim data are masked to the information available at this calendar time. If an event has not yet occurred by \(u_{i\ell}\), the subject is treated as censored at \(u_{i\ell}\). Subjects with \(i > n_\ell\) are not yet enrolled and contribute only to the maximum-sample-size prediction.
At each interim analysis the design estimates two predictive probabilities:
These are compared with thresholds \(S_\ell\) and \(F_\ell\), corresponding to the package
arguments Sn and Fn.
The time-to-event model is piecewise exponential. Let
\[0 = s_0 < s_1 < \cdots < s_{J-1}\]
be the vector of cut-points supplied through cutpoints.
These define intervals
\[[s_0, s_1), [s_1, s_2), \ldots, [s_{J-1}, \infty).\]
For arm \(a \in \{0,1\}\), where \(a = 1\) denotes the experimental arm and \(a = 0\) denotes the control arm, interval \(j\) has constant hazard \(\lambda_{aj}\). The subject-level hazard is therefore
\[h_a(t) = \lambda_{aj}, \qquad s_{j-1} \le t < s_j,\]
where the last interval has no finite upper endpoint. With
cutpoints = 0, \(J = 1\)
and the model reduces to an ordinary exponential model.
The cumulative hazard for arm \(a\) at time \(t\) is
\[H_a(t) = \sum_{j=1}^{J} \lambda_{aj} \{ \min(t, s_j) - s_{j-1} \}_+,\]
where \(s_J = \infty\) and \(\{x\}_+ = \max(x,0)\). The corresponding survival and cumulative event probability are
\[S_a(t) = \exp\{-H_a(t)\}, \qquad p_a(t) = 1 - S_a(t).\]
The helper prop_to_haz() solves the inverse problem used
in simulation planning: given event probabilities at one or more time
points, it returns the piecewise hazards that imply those probabilities.
The helper ppwe() evaluates \(p_a(\tau)\), and haz_to_prop()
applies this transformation to posterior hazard draws.
For observed follow-up \((T_i, \delta_i, A_i)\), the piecewise-exponential likelihood can be written in terms of interval-specific event counts and exposure times. Define
\[d_{aj} = \sum_i I(A_i = a)\delta_i I(T_i \in [s_{j-1}, s_j)),\]
and
\[y_{aj} = \sum_i I(A_i = a) \{ \min(T_i, s_j) - s_{j-1} \}_+ I(T_i > s_{j-1}).\]
Up to factors not involving \(\lambda_{aj}\), the likelihood contribution for arm \(a\) is
\[L_a(\boldsymbol{\lambda}_a; \mathcal{D}) \propto \prod_{j=1}^{J} \lambda_{aj}^{d_{aj}} \exp(-\lambda_{aj}y_{aj}),\]
where \(\boldsymbol{\lambda}_a = (\lambda_{a1},\ldots,\lambda_{aJ})^\top\). This sufficient-statistic form is what makes the Gamma posterior update available in closed form.
For each arm \(a\) and interval
\(j\), goldilocks assumes
an independent Gamma prior
\[\lambda_{aj} \sim \operatorname{Gamma}(\alpha_0, \beta_0),\]
where \(\alpha_0\) is the shape and
\(\beta_0\) is the rate. This follows
the stats::rgamma() parameterization. The argument
prior = c(alpha0, beta0) sets these two hyperparameters,
with default prior = c(0.1, 0.1). The same prior is applied
to every arm and every piecewise interval. Separate priors by arm or
interval cannot currently be specified through the package
interface.
At an analysis, let \(d_{aj}\) be the number of observed events in arm \(a\), interval \(j\), and let \(y_{aj}\) be the total observed exposure time in that arm and interval. Gamma-exponential conjugacy gives
\[\lambda_{aj} \mid \mathcal{D} \sim \operatorname{Gamma}(\alpha_0 + d_{aj}, \beta_0 + y_{aj}).\]
The package obtains \((d_{aj}, y_{aj})\) by splitting each subject’s observed follow-up over the cut-point intervals. Posterior draws are generated independently for each arm and interval. In early interim analyses, later piecewise intervals may have no exposure. In that case, the package propagates exposure and event counts from the nearest non-empty interval within the same arm and emits a warning. This prevents undefined posterior draws, but it should be interpreted as a computational fallback rather than evidence about that later interval.
The posterior density factorizes as
\[\pi(\boldsymbol{\lambda} \mid \mathcal{D}) = \prod_a \prod_{j=1}^{J} \pi(\lambda_{aj} \mid d_{aj}, y_{aj}),\]
where each marginal factor is the Gamma distribution above. Posterior predictive calculations integrate over this density rather than conditioning on a single plug-in hazard estimate.
At an interim analysis, subjects can be separated into three sets:
The first set contributes observed events and exposure to the posterior. The second and third sets require prediction.
For an enrolled subject who is event-free through time \(u\), a future event time is drawn from the conditional piecewise-exponential distribution
\[\Pr(T \le t \mid T > u) = \frac{F(t) - F(u)}{1 - F(u)}, \qquad t > u,\]
where \(F(t) = 1 - S(t)\). Equivalently, if \(U \sim \operatorname{Uniform}(0,1)\), then
\[T = F^{-1}\{F(u) + U[1 - F(u)]\}.\]
This is implemented by pwe_impute(). For future subjects
in the maximum-sample-size calculation, pwe_sim() draws
unconditional event times from the same piecewise-exponential model. In
the package implementation, these future event-time imputations do not
require explicitly simulating future enrollment times at the interim
look; the number of future subjects is determined by \(N_{\max} - n_\ell\).
Let \(\mathcal{D}_{\ell}^{\mathrm{obs}}\) denote the data observed at look \(\ell\), and let \(\mathcal{D}^{\mathrm{mis}}\) denote unobserved event times and future censoring indicators. The posterior predictive density is
\[p(\mathcal{D}^{\mathrm{mis}} \mid \mathcal{D}_{\ell}^{\mathrm{obs}}) = \int p(\mathcal{D}^{\mathrm{mis}} \mid \boldsymbol{\lambda}) \pi(\boldsymbol{\lambda} \mid \mathcal{D}_{\ell}^{\mathrm{obs}}) d\boldsymbol{\lambda}.\]
This density is the mathematical target, not an object that the
package evaluates in closed form. In the implementation,
posterior() draws \(\boldsymbol{\lambda}\),
impute_data() draws completed outcomes conditional on those
hazards, and test_stop_success() repeats the
impute-and-analyze cycle to approximate the predictive probability by
Monte Carlo simulation.
At interim look \(\ell\), the package first estimates the posterior of the hazard parameters from the currently observable data. It then uses Monte Carlo integration to approximate \(P_{n_\ell}\) and \(P_{\max,\ell}\).
Let \(\psi(\mathcal{D})\) be the final success indicator for a completed analysis dataset:
\[\psi(\mathcal{D}) = I\{Q(\mathcal{D}) > c\},\]
where \(Q(\mathcal{D})\) is the
final analysis quantity and \(c\) is
the success threshold. In package notation, \(c =\) prob_ha. The final
analysis quantity \(Q(\mathcal{D})\)
depends on the analysis method:
| Design setting | method |
\(Q(\mathcal{D})\) | Supported alternatives |
|---|---|---|---|
| Two-arm randomized trial | logrank |
\(1-p(\mathcal{D})\), where \(p(\mathcal{D})\) is the traditional log-rank test P-value, with one-sided variants defined in Section 6.1 | "less", "greater",
"two.sided" |
| Two-arm randomized trial | cox |
\(1-p(\mathcal{D})\), where \(p(\mathcal{D})\) is the traditional Wald-test P-value, with one-sided variants defined in Section 6.1 | "less", "greater",
"two.sided" |
| Two-arm randomized trial | chisq |
\(1-p(\mathcal{D})\), where \(p(\mathcal{D})\) is the traditional chi-square test P-value | "two.sided" |
| Two-arm randomized trial | bayes |
\(\Pr(\Delta < h_0 \mid \mathcal{D})\) or \(\Pr(\Delta > h_0 \mid \mathcal{D})\) | "less", "greater" |
| Single-arm trial | bayes |
\(\Pr(p_1(\tau) < h_0 \mid \mathcal{D})\) or \(\Pr(p_1(\tau) > h_0 \mid \mathcal{D})\) | "less", "greater" |
For frequentist analyses, prob_ha is therefore a
transformed P-value threshold. For example, prob_ha = 0.975
corresponds to a one-sided \(\alpha =
0.025\) rule. The value should be chosen during design
calibration to control the desired type I error rate across relevant
null scenarios.
The current-sample-size predictive probability \(P_{n_\ell}\) is estimated by:
Formally,
\[P_{n_\ell} = \operatorname{E}\{ \psi(\mathcal{D}_{n_\ell}^{\mathrm{comp}}) \mid \mathcal{D}_{\ell}^{\mathrm{obs}} \},\]
where \(\mathcal{D}_{n_\ell}^{\mathrm{comp}}\) is the completed dataset formed from the \(n_\ell\) enrolled subjects after imputing their remaining follow-up. Equivalently,
\[P_{n_\ell} = \int \psi(\mathcal{D}_{n_\ell}^{\mathrm{obs}}, \mathcal{D}_{n_\ell}^{\mathrm{mis}}) p(\mathcal{D}_{n_\ell}^{\mathrm{mis}} \mid \mathcal{D}_{\ell}^{\mathrm{obs}}) d\mathcal{D}_{n_\ell}^{\mathrm{mis}}.\]
Repeating the four-step procedure above for Monte Carlo replicate
\(m = 1,\ldots,M\), where \(M =\) N_impute, gives
\[\widehat{P}_{n_\ell} = \frac{1}{M}\sum_{m=1}^{M} I\{\textrm{success in replicate } m\}.\]
If
\[\widehat{P}_{n_\ell} > S_\ell,\]
accrual is stopped for expected success. Enrolled subjects are still followed to the planned final analysis time.
The maximum-sample-size predictive probability \(P_{\max,\ell}\) is estimated similarly, except that the completed trial includes both currently enrolled subjects and future subjects required to reach \(N_{\max}\). For each replicate, event times are imputed for the future subjects, the completed dataset is analyzed, and success is recorded:
\[P_{\max,\ell} = \operatorname{E}\{ \psi(\mathcal{D}_{N_{\max}}^{\mathrm{comp}}) \mid \mathcal{D}_{\ell}^{\mathrm{obs}} \}.\]
\[\widehat{P}_{\max,\ell} = \frac{1}{M}\sum_{m=1}^{M} I\{\textrm{success at } N_{\max} \textrm{ in replicate } m\}.\]
If
\[\widehat{P}_{\max,\ell} < F_\ell,\]
the trial stops for futility. Otherwise, accrual continues to the next interim look.
Thus the interim action at look \(\ell\) can be represented as
\[A_\ell = \begin{cases} \textrm{stop accrual for expected success}, & \widehat{P}_{n_\ell} > S_\ell,\\ \textrm{stop for futility}, & \widehat{P}_{\max,\ell} < F_\ell,\\ \textrm{continue accrual}, & \textrm{otherwise}. \end{cases}\]
The first look satisfying either stopping condition defines the adaptive stopping look,
\[L^* = \inf\{\ell : \widehat{P}_{n_\ell} > S_\ell \textrm{ or } \widehat{P}_{\max,\ell} < F_\ell\},\]
with \(L^* = L + 1\) if no interim stopping condition is met and the design continues to \(N_{\max}\).
The final analysis is conducted after accrual has stopped and the relevant follow-up has completed for the enrolled cohort, subject to the handling of loss to follow-up described below. The final rule supplies the binary success indicator used inside the predictive probability calculations.
For method = "logrank", success is based on a log-rank
test. For method = "cox", success is based on the Wald test
from a Cox proportional hazards regression. For these methods,
goldilocks stores \(1-p\)
in post_prob_ha; this is not a posterior probability, but
it puts frequentist and Bayesian rules on a common “larger is stronger
evidence” scale. For example, a one-sided test at \(\alpha = 0.025\) corresponds to
prob_ha = 0.975.
For the log-rank option, let \(Z_{\mathrm{LR}}\) denote the signed log-rank statistic, with positive values corresponding to excess events in the control arm under the package convention. Let \(p_{\mathrm{LR}}\) denote the traditional two-sided log-rank test P-value. For the Cox option, let \(\widehat{\eta}\) be the estimated log hazard ratio for treatment versus control and let
\[Z_{\mathrm{Cox}} = \frac{\widehat{\eta}} {\operatorname{se}(\widehat{\eta})}.\]
Here, lower treatment hazard corresponds to \(Z_{\mathrm{Cox}} < 0\). Let \(p_{\mathrm{Cox}}\) denote the traditional two-sided Wald-test P-value from the Cox model. The package uses the following method-specific definitions of \(Q(\mathcal{D})\):
| Method | Alternative | \(Q(\mathcal{D})\) |
|---|---|---|
logrank |
"less" |
\(\Phi(Z_{\mathrm{LR}})\) |
logrank |
"greater" |
\(1 - \Phi(Z_{\mathrm{LR}})\) |
logrank |
"two.sided" |
\(1 - p_{\mathrm{LR}}\) |
cox |
"less" |
\(1 - \Phi(Z_{\mathrm{Cox}})\) |
cox |
"greater" |
\(\Phi(Z_{\mathrm{Cox}})\) |
cox |
"two.sided" |
\(1 - p_{\mathrm{Cox}}\) |
chisq |
"two.sided" |
\(1 - p_{\chi^2}\) |
Here \(p_{\chi^2}\) is the traditional chi-square test P-value. The one-sided directions differ between the log-rank and Cox rows because of the sign convention of the package’s log-rank statistic.
For method = "chisq", the final event indicator is
compared between arms using a chi-square test. This discards event-time
information and is available only for two-arm designs with
alternative = "two.sided".
For method = "bayes", posterior hazard draws are mapped
to cumulative event probabilities at \(\tau\). In a two-arm design the treatment
effect is
\[\Delta = p_1(\tau) - p_0(\tau),\]
where \(p_1(\tau)\) is the experimental-arm event probability and \(p_0(\tau)\) is the control-arm event probability. The effect is on the event scale, not the survival scale. For an adverse event, benefit usually means \(\Delta < 0\).
Because \(p_a(\tau) = 1 - \exp\{-H_a(\tau)\}\), posterior draws of \(\Delta\) are obtained by transforming posterior draws of \(\boldsymbol{\lambda}_1\) and \(\boldsymbol{\lambda}_0\):
\[\Delta^{(b)} = \left[1 - \exp\{-H_1^{(b)}(\tau)\}\right] - \left[1 - \exp\{-H_0^{(b)}(\tau)\}\right], \qquad b = 1,\ldots,B.\]
The Monte Carlo estimate of the posterior probability for
alternative = "less" is
\[\widehat{\Pr}(\Delta < h_0 \mid \mathcal{D}) = \frac{1}{B}\sum_{b=1}^{B} I(\Delta^{(b)} < h_0),\]
where \(B =\)
N_mcmc.
With alternative = "less", success is declared when
\[\Pr(\Delta < h_0 \mid \mathcal{D}) > \texttt{prob_ha}.\]
With alternative = "greater", success is declared
when
\[\Pr(\Delta > h_0 \mid \mathcal{D}) > \texttt{prob_ha}.\]
The Bayesian final test is one-sided in the package;
alternative = "two.sided" is not supported.
In a single-arm design there is no \(p_0(\tau)\). The estimand becomes \(p_1(\tau)\), and \(h_0\) is an external benchmark event
probability. In clinical-trial terminology this benchmark is often
called a performance goal (PG) or objective performance criterion (OPC).
Consequently, single-arm designs in goldilocks require
method = "bayes".
Interim predictions impute outcomes that are not yet known. At the
final analysis, imputed_final controls whether subjects
lost to follow-up are also imputed.
If imputed_final = TRUE, the final analysis mirrors the
interim predictive framework: each imputed completed dataset is
analyzed, yielding a value of the success probability scale \(Q(\mathcal{D})\) (posterior probability for
method = "bayes" or transformed P-value for frequentist
methods), and post_prob_ha is the average of those values
across imputations. If imputed_final = FALSE, the final
analysis uses observed right-censored data for methods that can handle
censoring (logrank, cox, and
bayes). For chisq, lost-to-follow-up subjects
are excluded when imputed_final = FALSE because the
chi-square test has no mechanism for right-censored observations.
The loss-to-follow-up mechanism in the simulator is non-informative. Designs where dropout may depend on prognosis should be assessed with sensitivity analyses outside the default data-generating mechanism.
A Goldilocks design is calibrated by simulation. A single simulated trial describes one possible path; the design is characterized by repeated simulation over clinically relevant scenarios.
For a candidate design, sim_trials() repeatedly calls
survival_adapt() and summarise_sims()
estimates:
Let \(R = 1,\ldots,R_{\max}\) index simulated trials under a scenario \(\theta\), where \(\theta\) denotes the data-generating parameters such as control hazard, treatment hazard, accrual rate, loss-to-follow-up rate, and follow-up duration. The trial-level random variables are:
| Symbol | Meaning |
|---|---|
| \(N_R\) | enrolled sample size in simulated trial \(R\) |
| \(E_R\) | indicator that trial \(R\) stopped accrual for expected success |
| \(F_R\) | indicator that trial \(R\) stopped for futility |
| \(Z_R\) | final success indicator in trial \(R\) |
Let \(\Theta_0\) denote the null parameter space, i.e. the set of data-generating scenarios in which the treatment does not satisfy the alternative hypothesis. For a two-arm superiority trial this includes scenarios with no beneficial treatment effect; in practice, it should be explored across plausible nuisance parameters such as control event rates and accrual rates. The main operating characteristics are
\[\operatorname{Power}(\theta) = \Pr_\theta(Z_R = 1, F_R = 0),\]
\[\operatorname{Type\ I\ error}(\theta_0) = \Pr_{\theta_0}(Z_R = 1, F_R = 0), \qquad \theta_0 \in \Theta_0,\]
\[\Pr_\theta(\textrm{stop for expected success}) = \Pr_\theta(E_R = 1),\]
\[\Pr_\theta(\textrm{stop for futility}) = \Pr_\theta(F_R = 1),\]
and
\[\operatorname{E}_\theta(N_R), \qquad \operatorname{Var}_\theta(N_R).\]
The stop_and_fail summary estimates
\[\Pr_\theta(E_R = 1, Z_R = 0),\]
which is the probability that accrual stops for expected success but the final analysis does not meet the success criterion.
The basic workflow is:
out <- sim_trials(
hazard_treatment = ht,
hazard_control = hc,
cutpoints = cutpoints,
N_total = N_total,
lambda = lambda,
lambda_time = lambda_time,
interim_look = interim_look,
end_of_study = end_of_study,
prior = prior,
Fn = Fn,
Sn = Sn,
prob_ha = prob_ha,
N_impute = N_impute,
N_mcmc = N_mcmc,
N_trials = N_trials,
method = method)
summarise_sims(out$sims)Broglio et al. emphasize that type I error for this class of adaptive design should be examined across the null space, not only at one convenient null scenario. For time-to-event endpoints, the relevant null space includes plausible control event rates and accrual rates. Accrual rate is especially important because rapid enrollment can leave little endpoint information available at interim looks, increasing the uncertainty in both \(\widehat{P}_{n_\ell}\) and \(\widehat{P}_{\max,\ell}\). The follow-up period after accrual stops also affects operating characteristics because it determines how much additional information is observed before the final analysis.
The Monte Carlo sizes N_impute, N_mcmc, and
N_trials should be chosen so that simulation error is small
relative to the decision being made. Small values are appropriate for
examples and package tests, but final design calibration usually
requires many more trials and imputations.
The thresholds \(S_\ell\) and \(F_\ell\) may be constant across looks or may vary by look. They interact with the final analysis threshold, number and timing of looks, endpoint delay, accrual rate, loss to follow-up, prior distribution, and maximum sample size.
If type I error is too high, possible remedies include increasing
prob_ha, increasing the expected-success thresholds,
reducing the number of looks, or altering follow-up requirements. If
power is too low, the maximum sample size, futility threshold,
expected-success threshold, or final analysis threshold may need
reconsideration. Each change should be rechecked under null and
alternative scenarios.
The quantity stop_and_fail is particularly useful when
tuning \(S_\ell\). It estimates how
often a trial stops accrual for expected success but does not meet the
final success criterion after follow-up is complete. If this is too
large, the expected-success threshold is usually too permissive for the
amount of uncertainty present at interim looks.
A group-sequential design usually indexes interim analyses by information, such as the number of observed events, and may stop immediately for efficacy when a boundary is crossed. A Goldilocks design indexes sample-size selection analyses by enrolled sample size and explicitly incorporates future follow-up of the currently enrolled cohort.
This distinction matters for delayed outcomes. A trial may enroll many subjects before accumulating enough events for an event-driven interim analysis. Goldilocks uses the partial information available during accrual to decide whether additional subjects are needed, while preserving a single preplanned final analysis after the enrolled subjects have completed follow-up.
The time-to-event example in Broglio et al. uses a Gamma-exponential
prediction model. goldilocks extends this to a
piecewise-exponential model, so the hazard may change at prespecified
cut-points. This can be useful when there is a clinically plausible
early-risk period, but each additional interval adds parameters and can
make interim posteriors more diffuse.
The package also supports single-arm Bayesian Goldilocks designs by replacing the concurrent control with an external benchmark \(h_0\), often referred to as a performance goal (PG) or objective performance criterion (OPC). This is convenient for early-phase, rare-disease, or proof-of-concept settings, but validity then depends on the benchmark being transportable to the enrolled population.
Broglio KR, Connor JT, Berry SM. Not too big, not too small: a Goldilocks approach to sample size selection. Journal of Biopharmaceutical Statistics, 2014; 24(3): 685-705. doi:10.1080/10543406.2014.888569.
U.S. Food and Drug Administration. Adaptive Design Clinical Trials for Drugs and Biologics Guidance for Industry. December 2019. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/adaptive-design-clinical-trials-drugs-and-biologics-guidance-industry.
U.S. Food and Drug Administration. Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials. February 5, 2010. https://www.fda.gov/media/71512/download.