Evaluating diagnostic tests—should the same methods apply?
Article Outline
It is now accepted wisdom by regulators, purchasers, journal editors, and clinicians that the true effects of health care interventions can only be evaluated by well-designed randomized controlled trials. The study by Rutten et al,1 reports the results of a randomized controlled trial of N-terminal pro-brain natriuretic peptide in an emergency department in the Netherlands. Given that this is the third randomized controlled trial assessing the effectiveness and cost-effectiveness of a natriuretic peptide in an emergency department setting,2, 3 it is important to ask whether diagnostic tests should always be evaluated by randomized controlled trials or are other study types that generate test accuracy data sufficient?
Several factors make it more difficult to conduct, interpret, and apply the results of trials of diagnostic tests. First, when we assess the effect of a diagnostic test in a randomized controlled trial, we are assessing not just the effect of the test, but also the entire package of management and treatment decisions that flow on from the test results. Unlike trials of interventions such as pharmaceuticals, where, generally, it is reasonable to assume that the results of a trial in one population will be similar in another population, it is much less reasonable to assume that this is true in the case of a diagnostic test or diagnostic algorithm. The difference in outcomes observed when using or not using a test depends on the multiple clinical decisions that are possible after the test results, whether they are the results, they are from the old or the new diagnostic algorithm, or the results are positive or negative. The multiple possible consequences of the test results will be affected by context-specific factors, such as the availability and effectiveness of additional diagnostic tests and treatment options. It is necessary to assume that clinicians will use similar diagnostic and therapeutic options in the new clinical setting as in the trial clinical setting to apply the results of a randomized controlled trial of a diagnostic test. This assumption is often unreasonable, which makes it difficult to generalize the results of trials of diagnostic tests from one setting to another. In fact, the results of the trials themselves may impact on the future behavior of clinicians, altering the observed effect of a diagnostic test over time.
In addition, randomized controlled trials of diagnostic tests are highly subject to type II errors. In a trial of a diagnostic test, all the observed difference must be driven by the small proportion of patients in whom the diagnosis has changed as a result of the new test (or combination of tests). For example, if the new diagnostic algorithm causes an improvement in the sensitivity of detecting disease of 5% with no change in specificity and the disease occurs in 20% of patients included in the trial, any differences in outcomes between the 2 arms of the trial will only be possible in the 1% of patients in whom the diagnosis changes as a result of the test (Figure 1). Therefore, the sample size to detect a difference between the 2 arms of the trial needs to be multiplied by the inverse of this, that is, 100 times. So where a therapeutic trial may need a sample size of 300 patients, the diagnostic trial for the same condition with a 1% discordance in diagnostic results would need 30,000 patients. It is, therefore, not surprising that the trial reported in this journal and previous trials in the area have not found a statistically significant difference in clinical outcomes. Novel trial designs can and should be used to improve the efficiency of these trials, in which only patients with discordant results of the 2 tests under evaluation are randomized to receive the therapeutic pathway appropriate for the target condition or for the alternative diagnoses where the target condition is ruled out, including follow-up of such patients over time. These studies can improve our knowledge about diagnostic tests while conserving scarce research resources.

Figure 1.
Only patients with discordant results between the new and the old diagnostic tests drive changes in outcomes in trials of diagnostic tests (white area).
So when are trials of diagnostic tests necessary? Diagnostic tests may have differences in accuracy or in other domains such as invasiveness/harm, cost, and availability. Seldom are trials needed to evaluate these domains. A trial is not required to show that a stress echo is less invasive than a coronary artery angiogram, and that, all things being equal, the stress echo is preferable. Similarly, a trial is not required to evaluate the relative test performance (accuracy) of one test compared with another. Only a simple cross-sectional study in which all patients get both tests and the reference standard (or only patients with discordant results receive the reference standard) is required. Having ascertained the relative test performance, the implication for patient outcomes needs extrapolation, and this may require a trial. If the diagnostic test being considered has similar sensitivity but improved specificity, then trials are not necessary. The effect of the new test will be clear—fewer patients without the target condition will be inappropriately tested with additional tests and/or treated.4 The difficult scenario is when the new test has increased sensitivity, when the value of the new test will rely on the benefits of treating the extra cases of disease that are detected. This changes the spectrum of the disease being treated, usually in the direction of milder disease where there may be less net treatment benefit. If the treatment algorithm is without appreciable adverse effects, the disease has a limited spectrum, or if the spectrum of disease detected by the new test has been treated effectively in the context of a trial, then additional trials are unnecessary (decision analytical modeling only is required), but this is usually not the case.
How should diagnostic tests be evaluated when it is likely that they will change the spectrum of patients being tested? Natriuretic peptides are such an example because the new test is generally quicker and more easily accessible to clinicians and can be used to triage patients for further testing. If natriuretic peptides did not change the spectrum of patients, a simple analysis would show that using brain natriuretic peptide (BNP) to triage patients before echocardiogram dominates sending all patients suspected with heart failure for echocardiogram in cost-effectiveness terms: it allows a quicker time to treatment, means fewer patients require echocardiography, and reduces costs overall. However, a simple decision analytic model does not help in this case. The accessibility of BNP means that the spectrum of patients is likely to change with the introduction of testing, with patients who are at both lower and higher risk of heart failure being more likely to be tested than currently. Given the numbers of patients who present with the nonspecific symptoms of heart failure, the numbers of patients at relatively low risk but who may be suspected of having heart failure and who would be tested with the increasing availability of natriuretic peptides could be very large. We cannot assess how the spectrum of patients might change from the current trials such as the trial published here, because patients were randomized to either be tested with natriuretic peptides or not. This introduces considerable uncertainty for funders trying to determine the cost-effectiveness of testing. How might the spectrum of patients change over time? At what point do the costs of the extra tests outweigh the benefits?
Because trials of natriuretic peptides are likely to be underpowered and difficult to generalize, and there is considerable uncertainty about the effect of the introduction of testing on the spectrum of patients being tested, how should we evaluate these tests for their effectiveness and cost-effectiveness? There are now several dozen studies assessing the diagnostic accuracy of BNP and N-terminal pro-brain natriuretic peptide, including in emergency departments and primary care settings. Further diagnostic accuracy studies and trials are unlikely to be helpful. What we now need to understand is how the introduction of these tests will affect the test ordering behavior of physicians in different clinical settings. Randomized clinical trials of different diagnostic strategies are unlikely to be the most transferable or cost-effective study design to assess these effects. We need to understand the potential health costs and benefits that flow as a consequence of using the natriuretic peptides. BNP is primarily a rule-out test, that is, the value of the test is mostly in ruling the diagnosis of heart failure out in patients with symptoms and signs of the disease. Once heart failure has been ruled out, an alternative diagnosis needs to be determined, and so the major health benefits will derive from the treatment benefits where an alternative diagnosis is established and treated earlier than would have occurred without BNP.
In conclusion, when evaluating new tests, consider the domains to be evaluated. Whether a trial is needed or not depends upon the test performance properties. Trials are likely to be necessary when testing is likely to change the spectrum of patients being tested, but will have the greatest efficiency using novel designs such as only randomizing those patients with discordant results on the new and the old diagnostic testing algorithms. Evaluating diagnostic tests is not simple, and we cannot rely simply on randomized controlled trials for their evaluation.
References
- N-terminal pro-brain natriuretic peptide testing in the emergency department: benefecial effects on hospitalization, costs, and outcome. Am Heart J. 2008;156:71–77
- N-terminal pro–B-type natriuretic peptide testing improves the management of patients with suspected acute heart failure: primary results of the Canadian prospective randomized multicenter IMPROVE-CHF study. Circulation. 2007;115:3103–3110
- Use of B-type natriuretic peptide in the evaluation and management of acute dyspnea. N Engl J Med. 2004;350:647–654
- . When is measuring sensitivity and specificity sufficient to evaluate a diagnostic test, and when do we need randomized trials?. Ann Intern Med. 2006;144:850–855
PII: S0002-8703(08)00220-2
doi:10.1016/j.ahj.2008.03.014
© 2008 Mosby, Inc. All rights reserved.
