Broglio et al. (2014) describe an example diagnostic study from Julian et al. (2008). This was a single-arm diagnostic study of the GeneSearch Breast Lymph Node (BLN) Assay for patients with breast cancer undergoing sentinel lymph node (SLN) biopsy. The new BLN test yielded a result during the surgery, i.e. intra-operatively, which yielded a positive or negative test for metastasis. The gold standard reference test was histologic evaluation. Whilst histologic evaluation may not be known immediately, we assume it is available soon after, meaning both test results (BLN and reference) are known immediately.
The trial would be declared successful if the new test sensitivity (π1) and specificity (π0) satisfy the following posterior probability inequalities:
P[π1 > 0.70|Data] ≥ 0.985 and P[π0 > 0.90|Data] ≥ 0.985.
Based on this, we know that
The thresholds of 0.985 were reported as being chosen to control the overall type I error at <5%. In general, when there are interim looks at the data, a type I error penalty must be paid. The choice of thresholds can be determined using simulation (in the null model space) at multiple sequential thresholds, and identify which threshold(s) lead to type I error below the required value.
Their study did not specify what the assumed sensitivity, specificity, or true prevalence was. From the pilot (beta) study reported by Julian et al. (2008), we assume that the true values are:
There was no prevalence value reported from the pilot study, so we
assume prev_true <- 0.20
.
The prior distributions for prior_sens
,
prior_spec
, and prior_prev
, are all assumed to
be independent Beta(0.1, 0.1). This is
passed as:
The trial sample size selection analyses were planned starting when n = 200 patients were enrolled, and after every additional n = 50 patients, up to a maximum of n = 700. Hence, we can set:
We can simulate trials under this design using the
adaptDiag
package function multi_trial()
function as follows:
library(adaptDiag)
fit_power <- multi_trial(
sens_true = 0.824,
spec_true = 0.963,
prev_true = 0.20,
endpoint = "both",
sens_pg = 0.7,
spec_pg = 0.9,
prior_sens = c(0.1, 0.1),
prior_spec = c(0.1, 0.1),
prior_prev = c(0.1, 0.1),
succ_sens = 0.985,
succ_spec = 0.985,
n_at_looks = seq(200, 700, 50),
n_mc = 10000,
n_trials = 200,
ncores = 1L)
Here, we have only simulated 200 trials here, using a total of 1
core. In general, n_trials
should be larger, and
ncores
should be set as large as possible dependent on the
machine being used to perform the calculations. The Monte Carlo
integration performed for futility assessment is based on
n_mc = 10000
draws.
To determine the operating characteristics, we need to also specify the futility stopping rule. That is, we wish to calculate the posterior predictive probability of eventual trial success at the maximum sample size. Using conjugacy, the posterior predictive distribution of cases is Beta-Binomial (Broglio et al. 2014). For the GeneSearch BLN Assay study, the futility stopping rule was set at 5% – that is, if the posterior predictive probability of eventual study success is <0.05, the study will terminate early.
We add in an additional criteria that the study cannot stop for early success or futilty until a fixed number of reference positive cases are observed. We implement these 2 criteria and extract the operating characteristic as follows:
summarise_trials(fit_power, min_pos = 30, fut = 0.05)
#> n
#> decision 200 250 300 350 400 450 500 550 600 650 700
#> early win 64 25 27 20 18 8 3 9 5 4 0
#> late win 0 0 0 0 0 0 0 0 0 0 5
#> no stopping 0 0 0 0 0 0 0 0 0 0 3
#> stop for futility 5 1 0 1 0 0 0 0 2 0 0
#> power stop_futility n_avg sens spec mean_pos
#> 1 0.94 0.045 327 0.8391069 0.9640512 64.3
The printed output shows a table with columns listing the sample size looks, and rows listing the reasons for the trial stopping. Below this table we also see estimates of the study power, average sample size, proportion of trials stopping for futility, as well as posterior estimates of the sensitivity and specificity.
We also wish to evaluate the type I error. We can straightforwardly calculate this by assuming the true values are equal to the null.
fit_type1 <- update(fit_power,
sens_true = 0.7,
spec_true = 0.9)
summarise_trials(fit_type1, min_pos = 30, fut = 0.05)
#> n
#> decision 200 250 300 350 400 450 550 650 700
#> no stopping 0 0 0 0 0 0 0 0 1
#> stop for futility 158 16 10 8 3 2 1 1 0
#> power stop_futility n_avg sens spec mean_pos
#> 1 0 0.995 227 0.6917301 0.9012067 45.125
The operating characteristics show the type I error is well controlled at the 5% level. Note: in practice, one should run many simulations (at least 10,000) to get accurate estimates of the type I error.
The true prevalence was passed as prev_true = 0.20
,
commensurate with 20% of subjects in the trial testing positive with the
reference histology test. We do not know what the actual value of the
true prevalence was when the trial was originally designed. (In fact, we
do not know what the input true parameters for sensitivity and
specificity were either.) It is reasonably straightforward to search a
grid of input parameters, as follows.
Assume we are unsure of what the true prevalence might be. We believe
it will be 15% and 30%. We then wish to evaluate the operating
characteristics at a grid of plausible values. If we let , we can apply
the multi_trial()
function within a loop:
tab <- NULL
for (i in 1:length(prev_true_vec)) {
fit_power_i <- multi_trial(
sens_true = 0.824,
spec_true = 0.963,
prev_true = prev_true_vec[i],
endpoint = "both",
sens_pg = 0.7,
spec_pg = 0.9,
prior_sens = c(0.1, 0.1),
prior_spec = c(0.1, 0.1),
prior_prev = c(0.1, 0.1),
succ_sens = 0.985,
succ_spec = 0.985,
n_at_looks = seq(200, 700, 50),
n_mc = 1000,
n_trials = 100,
ncores = 1L)
out <- summarise_trials(fit_power_i, min_pos = 30, fut = 0.05)
tab <- rbind(tab, out)
}
We can then use this this to understand how the power changes with the the assumed true prevalence value, as
plot(prev_true_vec, tab$power,
xlab = "True prevalence",
ylab = "Power",
main = "Prevalence vs. power",
type = "b")
Note: to allow the vignette to compile quickly, we
have only used n_mc = 1000
and n_trials = 100
.
In practice, these values would both need substantially larger.
Broglio KR, Connor JT, Berry SM. Not too big, not too small: a Goldilocks approach to sample size selection. Journal of Biopharmaceutical Statistics, 2014; 24(3): 685–705.
Julian TB, Blumencranz P, Deck K, Whitworth P, Berry DA, Berry SM, Rosenberg A, et al. Novel intraoperative molecular test for sentinel lymph node metastases in patients with early-stage breast cancer. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology, 2008; 26(20): 3338–3345.