Example: GeneSearch Breast Lymph Node Assay

Problem

Broglio et al. (2014) describe an example diagnostic study from Julian et al. (2008). This was a single-arm diagnostic study of the GeneSearch Breast Lymph Node (BLN) Assay for patients with breast cancer undergoing sentinel lymph node (SLN) biopsy. The new BLN test yielded a result during the surgery, i.e. intra-operatively, which yielded a positive or negative test for metastasis. The gold standard reference test was histologic evaluation. Whilst histologic evaluation may not be known immediately, we assume it is available soon after, meaning both test results (BLN and reference) are known immediately.

The trial would be declared successful if the new test sensitivity (π₁) and specificity (π₀) satisfy the following posterior probability inequalities:

P[π₁ > 0.70|Data] ≥ 0.985 and P[π₀ > 0.90|Data] ≥ 0.985.

Based on this, we know that

sens_pg <- 0.7
spec_pg <- 0.9

succ_sens <- 0.985,
succ_spec <- 0.985,

endpoint <- "both"

The thresholds of 0.985 were reported as being chosen to control the overall type I error at <5%. In general, when there are interim looks at the data, a type I error penalty must be paid. The choice of thresholds can be determined using simulation (in the null model space) at multiple sequential thresholds, and identify which threshold(s) lead to type I error below the required value.

Their study did not specify what the assumed sensitivity, specificity, or true prevalence was. From the pilot (beta) study reported by Julian et al. (2008), we assume that the true values are:

sens_true <- 0.824
spec_true <- 0.963
prev_true <- 0.2

There was no prevalence value reported from the pilot study, so we assume prev_true <- 0.20.

The prior distributions for prior_sens, prior_spec, and prior_prev, are all assumed to be independent Beta(0.1, 0.1). This is passed as:

prior_sens <- c(0.1, 0.1)
prior_spec <- c(0.1, 0.1)
prior_prev <- c(0.1, 0.1)

The trial sample size selection analyses were planned starting when n = 200 patients were enrolled, and after every additional n = 50 patients, up to a maximum of n = 700. Hence, we can set:

n_at_looks <- seq.int(200, 700, by = 50)

Simulation

We can simulate trials under this design using the adaptDiag package function multi_trial() function as follows:

library(adaptDiag)

fit_power <- multi_trial(
  sens_true = 0.824,
  spec_true = 0.963,
  prev_true = 0.20,
  endpoint = "both",
  sens_pg = 0.7,
  spec_pg = 0.9,
  prior_sens = c(0.1, 0.1),
  prior_spec = c(0.1, 0.1),
  prior_prev = c(0.1, 0.1),
  succ_sens = 0.985,
  succ_spec = 0.985,
  n_at_looks = seq(200, 700, 50),
  n_mc = 10000,
  n_trials = 200,
  ncores = 1L)

Here, we have only simulated 200 trials here, using a total of 1 core. In general, n_trials should be larger, and ncores should be set as large as possible dependent on the machine being used to perform the calculations. The Monte Carlo integration performed for futility assessment is based on n_mc = 10000 draws.

To determine the operating characteristics, we need to also specify the futility stopping rule. That is, we wish to calculate the posterior predictive probability of eventual trial success at the maximum sample size. Using conjugacy, the posterior predictive distribution of cases is Beta-Binomial (Broglio et al. 2014). For the GeneSearch BLN Assay study, the futility stopping rule was set at 5% – that is, if the posterior predictive probability of eventual study success is <0.05, the study will terminate early.

We add in an additional criteria that the study cannot stop for early success or futilty until a fixed number of reference positive cases are observed. We implement these 2 criteria and extract the operating characteristic as follows:

summarise_trials(fit_power, min_pos = 30, fut = 0.05)
#>                    n
#> decision            200 250 300 350 400 450 500 550 600 650 700
#>   early win          61  26  23  21  15  15   6   4   5   2   0
#>   no stopping         0   0   0   0   0   0   0   0   0   0   2
#>   stop for futility   7   1   3   2   1   1   0   0   2   3   0
#>   power stop_futility  n_avg     sens     spec mean_pos
#> 1  0.89           0.1 319.25 0.836173 0.963831   63.525

The printed output shows a table with columns listing the sample size looks, and rows listing the reasons for the trial stopping. Below this table we also see estimates of the study power, average sample size, proportion of trials stopping for futility, as well as posterior estimates of the sensitivity and specificity.

We also wish to evaluate the type I error. We can straightforwardly calculate this by assuming the true values are equal to the null.

fit_type1 <- update(fit_power,
                    sens_true = 0.7,
                    spec_true = 0.9)

summarise_trials(fit_type1, min_pos = 30, fut = 0.05)
#>                    n
#> decision            200 250 300 350 450
#>   stop for futility 159  23   6   8   4
#>   power stop_futility  n_avg      sens      spec mean_pos
#> 1     0             1 219.75 0.6939306 0.8997817    44.17

The operating characteristics show the type I error is well controlled at the 5% level. Note: in practice, one should run many simulations (at least 10,000) to get accurate estimates of the type I error.

Grid search

The true prevalence was passed as prev_true = 0.20, commensurate with 20% of subjects in the trial testing positive with the reference histology test. We do not know what the actual value of the true prevalence was when the trial was originally designed. (In fact, we do not know what the input true parameters for sensitivity and specificity were either.) It is reasonably straightforward to search a grid of input parameters, as follows.

Assume we are unsure of what the true prevalence might be. We believe it will be 15% and 30%. We then wish to evaluate the operating characteristics at a grid of plausible values. If we let , we can apply the multi_trial() function within a loop:

tab <- NULL

for (i in 1:length(prev_true_vec)) {
  fit_power_i <- multi_trial(
  sens_true = 0.824,
  spec_true = 0.963,
  prev_true = prev_true_vec[i],
  endpoint = "both",
  sens_pg = 0.7,
  spec_pg = 0.9,
  prior_sens = c(0.1, 0.1),
  prior_spec = c(0.1, 0.1),
  prior_prev = c(0.1, 0.1),
  succ_sens = 0.985,
  succ_spec = 0.985,
  n_at_looks = seq(200, 700, 50),
  n_mc = 1000,
  n_trials = 100,
  ncores = 1L)
  
  out <- summarise_trials(fit_power_i, min_pos = 30, fut = 0.05)
  tab <- rbind(tab, out)
}

We can then use this this to understand how the power changes with the the assumed true prevalence value, as

plot(prev_true_vec, tab$power,
     xlab = "True prevalence",
     ylab = "Power",
     main = "Prevalence vs. power",
     type = "b")

Note: to allow the vignette to compile quickly, we have only used n_mc = 1000 and n_trials = 100. In practice, these values would both need substantially larger.

References

Broglio KR, Connor JT, Berry SM. Not too big, not too small: a Goldilocks approach to sample size selection. Journal of Biopharmaceutical Statistics, 2014; 24(3): 685–705.

Julian TB, Blumencranz P, Deck K, Whitworth P, Berry DA, Berry SM, Rosenberg A, et al. Novel intraoperative molecular test for sentinel lymph node metastases in patients with early-stage breast cancer. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology, 2008; 26(20): 3338–3345.