NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Institute for Quality and Efficiency in Health Care. General Methods [Internet]. Version 4.2. Cologne, Germany: Institute for Quality and Efficiency in Health Care (IQWiG); 2015 Apr 22.

Logo of Institute for Quality and Efficiency in Health Care (IQWiG)

Institute for Quality and Efficiency in Health Care (IQWiG): IQWiG Methods Resources [Internet].

3Benefit assessment of medical interventions

3.1. Patient-relevant medical benefit and harm

3.1.1. Definition of patient-relevant medical benefit and harm

The term “benefit” refers to positive causal effects, and the term “harm” refers to negative causal effects of a medical intervention on patient-relevant outcomes (see below). In this context, “causal” means that it is sufficiently certain that the observed effects can be ascribed solely to the intervention to be tested [595]. The terms “benefit” and “harm” refer to a comparison with a placebo (or another type of sham intervention) or no treatment.

In the case of a comparison between the medical intervention to be assessed and a clearly defined alternative medical intervention, the following terms are used in the comparative assessment of beneficial or harmful aspects (the terms are always described from the point of view of the intervention to be assessed):

Beneficial aspects:

In the case of a greater benefit, the term “added benefit” is used.

In the case of lesser or comparable benefit, the terms “lesser” or “comparable” benefit are used.

Harmful aspects:

The terms “greater”, “comparable” and “lesser” harm are used.

The assessment of the evidence should preferably come to a clear conclusion that either there is proof of a(n) (added) benefit or harm of an intervention, or there is proof of a lack of a(n) (added) benefit or harm, or there is no proof of a(n) (added) benefit or harm or the lack thereof, and it is therefore unclear whether the intervention results in a(n) (added) benefit or harm.

In addition, in the case of (added) benefit or harm that is not clearly proven, it may be meaningful to perform a further categorization as to whether at least “indications” or even only “hints” of an (added) benefit or harm are available (see Section 3.1.4).

As the benefit of an intervention should be related to the patient, this assessment is based on the results of studies investigating the effects of an intervention on patient-relevant outcomes. In this connection, “patient-relevant” refers to how a patient feels, functions or survives [44]. Consideration is given here to both the intentional and unintentional effects of the intervention that in particular allow an assessment of the impact on the following patient-relevant outcomes to determine the changes related to disease and treatment:

  1. mortality
  2. morbidity (symptoms and complications)
  3. health-related quality of life

These outcomes are also named in SGB V as outcomes primarily to be considered, for example, in § 35 (1b) SGB V. As supplementary information, consideration can be given to the time and effort invested in relation to the disease and the intervention. This also applies to patient satisfaction, insofar as health-related aspects are represented here. However, a benefit or added benefit cannot be determined on the basis of these 2 outcomes alone.

For all listed outcomes it may be necessary that an assessment is made in relation to information on how other outcomes are affected by the intervention. In the event of particularly serious or even life-threatening diseases, for example, it is usually not sufficient only to demonstrate an improvement in quality of life by application of the intervention to be assessed, if at the same time it cannot be excluded with sufficient certainty that serious morbidity or even mortality are adversely affected to an extent no longer acceptable. This is in principle consistent with the ruling by the highest German judiciary that certain (beneficial) aspects must be assessed only if therapeutic effectiveness has been sufficiently proven [81]. On the other hand, in many areas (particularly in palliative care) an impact on mortality cannot be adequately assessed without knowledge of accompanying (possibly adverse) effects on quality of life.

In accordance with §35b (1) Sentence 4 SGB V, the following outcomes related to patient benefit are to be given appropriate consideration: increase in life expectancy, improvement in health status and quality of life, as well as reduction in disease duration and adverse effects. These dimensions of benefit are represented by the outcomes listed above; for example, the improvement in health status and the reduction in disease duration are aspects of direct disease-related morbidity; the reduction in adverse effects is an aspect of therapy-related morbidity.

Those outcomes reliably and directly representing specific changes in health status are primarily considered. In this context, individual affected persons as well as organizations of patient representatives and/or consumers are especially involved in the topic-related definition of patient-relevant outcomes. In the assessment of quality of life, only instruments should be used that are suited for application in clinical trials and have been evaluated accordingly [174]. In addition, valid surrogate endpoints can be considered in the benefit assessment (see Section 3.1.2).

Both beneficial and harmful aspects can have different relevance for the persons affected; these aspects may become apparent through qualitative surveys or the Institute's consultations with affected persons and organizations of patient representatives and/or consumers in connection with the definition of patient-relevant outcomes (examples of corresponding methods are listed at the end of Section 3.1.4). In such a situation it may be meaningful to establish a hierarchy of outcomes. General conclusions on benefit and harm are then primarily based on proof regarding higher-weighted outcomes. Planned subgroup and sensitivity analyses are then primarily conducted for higher-weighted outcomes, whereas such analyses are not routinely conducted for the remaining ones.

Diagnostic tests can be of indirect benefit by being a precondition for therapeutic interventions through which it is possible to achieve an effect on the patient-relevant outcomes mentioned above. The precondition for the benefit of such tests is therefore the existence and the proven benefit of a treatment for patients, depending on the test result.

Interventions can also have consequences for those indirectly affected, for example, relatives and carers. If appropriate, these consequences can also be considered within the framework of the Institute's reports.

The term “benefit assessment” refers to the whole process of the assessment of medical interventions with regard to their positive and negative causal effects compared with a clearly defined alternative treatment, a placebo (or a different type of sham intervention), or no treatment. In this context, beneficial and harmful aspects of an intervention are initially assessed on an outcome-specific basis and then presented. In addition, a combined evaluation of outcome-related beneficial and harmful aspects is possible (see Section 3.1.4) so that, for example, when the effects on all other outcomes have been analysed, the outcome-specific “lesser harm” from an intervention (in terms of a reduction in adverse effects) can lead to the balanced conclusion of an “added benefit”.

3.1.2. Surrogates of patient-relevant outcomes

Surrogate endpoints are frequently used in medical research as a substitute for patient-relevant outcomes, mostly to arrive at conclusions on patient-relevant (added) benefits earlier and more simply [15,194,444]. Most surrogate endpoints are, however, unreliable in this regard and can be misleading when used in a benefit assessment [102,219,227]. Surrogate endpoints are therefore normally considered in the Institute's benefit assessments only if they have been validated beforehand by means of appropriate statistical methods within a sufficiently restricted patient population and within comparable interventions (e.g. drugs with a comparable mode of action). A surrogate endpoint can be regarded as valid if the effect of an intervention on the patient-relevant outcome to be substituted is explained to a sufficient degree by the effect on the surrogate endpoint [28,586]. The necessity to evaluate surrogate endpoints may have particular relevance within the framework of the early benefit assessment of drugs (see Section 3.3.3), as regulatory approval procedures primarily investigate the efficacy of a drug, but not always its patient-relevant benefit or added benefit.

There is neither a standard procedure for surrogate endpoint validation nor a general best estimation method nor a generally accepted criterion which, if fulfilled, would demonstrate validity [380]. However, the current methodological literature frequently discusses correlation-based procedures for surrogate validation, with estimation of correlation measures at a study level and individual level [286]. The Institute's benefit assessments therefore give preference to validations on the basis of such procedures. These procedures usually require a meta-analysis of several randomized studies, in which both the effects on the surrogate endpoint and those on the patient-relevant outcome of interest are investigated [86,400]. Alternative methods [586] are only considered in justified exceptional cases.

For correlation-based procedures the following conditions are normally required to demonstrate validity: on the one hand, a high correlation between the surrogate and the patient-relevant outcome at the individual level, and on the other hand, a high correlation between effects on the surrogate and effects on the patient-relevant outcome at a study level [86,88]. As in the Institute's benefit assessments, conclusions related to groups of patients are drawn, the assessment of the validity of a surrogate endpoint is primarily based on the degree of correlation at the level of treatment effects, i.e. the study level. In addition to the degree of correlation, for the assessment of validity of a surrogate endpoint the reliability of results of the validation process is considered. For this purpose, various criteria are drawn upon [286]. For example, associations observed between a surrogate endpoint and the corresponding patient-relevant outcome for an intervention with a specific mode of action are not necessarily applicable to other interventions used to treat the same disease, but with a different mode of action [193,219,227,380]. The studies on which the validation was based must therefore have been conducted in patient populations and interventions that allow conclusions on the therapeutic indication investigated in the benefit assessment as well as on the test intervention and comparator intervention. In order to assess transferability, in validation studies including various disease entities or interventions, analyses on heterogeneity should at least be available.

In the event that a surrogate endpoint cannot be validated conclusively (e.g. if correlation is not high enough), it is also possible to apply the “surrogate threshold effect (STE) concept” [85,286]. For this purpose, the effect on the surrogate resulting from the studies included in the benefit assessment is related to the STE [88,400].

For the Institute's benefit assessments, conclusions on patient-relevant outcomes can be drawn from the effects on the surrogate, depending on verification of the validity of the surrogate or the evaluation of the STE. The decisive factor for the first point is the degree of correlation of the effects on the surrogate and the patient-relevant outcome and the reliability of validation in the validation studies. In the evaluation of an STE, the decisive criterion is the size of the effect on the surrogate in the studies included in the benefit assessment compared with the STE. In the case of a statistically significant effect on the surrogate endpoints, all gradations of conclusions on the (added) benefit with regard to the corresponding patient-relevant outcome according to Section 3.1.4 are possible, depending on the constellation.

Surrogate endpoints that are not valid or for which no adequate validation procedure was conducted can nevertheless be presented in the Institute's reports. However, independent of the observed effects, such endpoints are not suited to provide proof of verification of an (added) benefit of an intervention.

Depending on the proximity to a corresponding patient-relevant outcome, the literature uses various other terms to describe surrogate endpoints (e.g. intermediate endpoint). However, we dispense with such a distinction here, as the issue of the necessary validity remains unaffected by this. In addition it should be considered that an endpoint can at the same time represent a patient-relevant outcome and, beyond this, can also be regarded as a surrogate (i.e. a substitute) for a different patient-relevant outcome.

3.1.3. Assessment of the harm of medical interventions

The use of any type of medical intervention (drug, non-drug, surgical, diagnostic, preventive, etc.) carries per se the risk of adverse effects. In this context, the term “adverse effects” refers to all events and effects representing individually perceived or objectively detectable physical or mental harm that may to a greater or lesser extent cause a short- or long-term reduction in life expectancy, an increase in morbidity, or impairment in quality of life. It should be noted that if the term “adverse effects” is used, a causal relationship to the intervention is assumed, whereas the issue of causality still remains open with the term “adverse events” [109].

The term “harm” describes the occurrence of adverse effects when using a medical intervention. The description of harm is an essential and equal component in the benefit assessment of an intervention. It ensures the informed, population-related, but also individual weighing of benefit and harm [602]. A prerequisite for this is that the effect sizes of a medical intervention can be described by means of the data available, both for its desired as well as its adverse effects, and compared with therapy alternatives, for example.

However, in a systematic review, the analysis, assessment, and reporting of the harm of a medical intervention are often far more difficult than those of the (added) benefit. This applies in particular to unexpected adverse events [109]. Studies are typically designed to measure the effect of a medical intervention on a few predefined outcomes. In most cases, these are outcomes representing effectiveness, while adverse effects are concomitantly recorded as adverse events. The results for adverse events depend heavily on the underlying methods for data collection. For example, explicit queries on defined adverse events normally result in the determination of higher event rates than do general queries [41,304]. To detect unexpected adverse events in particular, general queries about the well-being of patients are however required. In addition, studies designed to specifically detect rare, serious adverse effects (including the description of a causal relationship to the medical intervention) are considerably underrepresented in medical research [48,164,303]. Moreover, reporting of adverse events in individual studies is of poor quality, which has also led to amendment of the CONSORT12 statement for RCTs [302]. Finally, the systematic assessment of the adverse effects of an intervention is also made more difficult by the fact that the corresponding coding in bibliographic databases is insufficient, so that the specific search for relevant scientific literature often produces an incomplete picture [127].

The obstacles noted above often make the investigation of harm more difficult. In cases where complete clinical study reports are available for the assessment, at least sufficient data transparency is also given for adverse events. However, it is still necessary to find a meaningful balance between the completeness of the evaluation of aspects of harm and the resources invested. Consequently, it is necessary to limit the evaluation and reporting to relevant adverse effects. In particular, those adverse effects can be defined as relevant that may

completely or almost completely offset the benefit of an intervention

substantially vary between 2 or more otherwise equivalent treatment options

occur predominantly with treatment options that may be particularly effective

have a dose-effect relationship

be regarded by patients as especially important

be accompanied by serious morbidity or even increased mortality, or be associated with substantial impairment in quality of life

The Institute observes the following principles when evaluating and reporting adverse effects: In the benefit assessment, the initial aim is to compile a selection of potentially relevant adverse effects that are essential in deciding for or against the use of the intervention to be assessed. In this context, the selection of adverse effects and events is made in accordance with the criteria outlined above. This compilation is made within the framework of the preliminary literature search for the particular research question, especially on the basis of data from controlled intervention studies in which the benefit of the intervention was specifically investigated. In addition, and if appropriate, the compilation is made on the basis of available epidemiological data (e.g. from cohort or case-control studies), as well as pharmacovigilance and regulatory data, etc. In individual cases, data obtained from animal trials and experiments to test pathophysiological constructs may be useful. The compilation of potentially relevant adverse effects described above forms the foundation for assessment of harm on the basis of the studies included in the benefit assessment. In this context, if possible and meaningful, pooled analyses (e.g. overall rates of serious adverse events) may also be drawn upon.

3.1.4. Outcome-related assessment

The benefit assessment and the estimation of the extent of the (un)certainty of results generally follow international EBM standards as developed, for example, by the GRADE13 group [23].

Medical interventions are compared with other interventions, sham interventions (e.g. placebo), or no intervention in respect of their effects on defined patient-relevant outcomes, and their (added) benefit and harm are described in summary. For this purpose, on the basis of the analysis of the scientific data available, for each predefined patient-relevant outcome separately a conclusion on the evidence base of the (added) benefit and harm is drawn in 4 levels with regard to the respective certainty of the conclusion: The data provide either “proof” (highest certainty of conclusions), an “indication” (medium certainty of conclusions), a “hint” (weakest certainty of conclusions) in respect of the benefit or harm of an intervention, or none of these 3 situations applies. The latter is the case if no data are available or the data available do not allow any of the other 3 conclusions to be drawn.

Depending on the research question, the conclusions refer to the presence or lack of a(n) (added) benefit or harm. The prerequisite for conclusions on the lack of a(n) (added) benefit or harm are well-founded definitions of irrelevance ranges (see Section 8.3.6).

The certainty of results is an important criterion for the inference of conclusions on the evidence base. In principle, every result from an empirical study or systematic review of empirical studies is potentially uncertain and therefore the certainty of results must be examined. In this context, one distinguishes between qualitative and quantitative certainty of results. The qualitative certainty of results is impaired by systematic errors (bias; see Section 8.3.11) such as information errors, selection errors and confounding. The quantitative certainty of results is influenced by random errors caused by sampling (statistical uncertainty).

The qualitative certainty of results is thus determined by the study design, from which evidence levels can be inferred (see Section 8.1.3). It is also determined by (outcome-related) measures for further prevention or minimization of potential bias, which must be assessed depending on the study design (see Section 8.1.4). Such measures include, for example, the blinded assessment of outcomes, an analysis based on all included patients (potentially supported by the application of adequate imputation methods for missing values), and, if appropriate, the use of valid measurement instruments.

The quantitative certainty of results is directly connected to the sample size (i.e. the number of patients investigated in a study or the number of [primary] studies included in a systematic review), as well as to the variability observed within and between studies. If the underlying data allow for this, the statistical uncertainty can be quantified and assessed as the standard error or confidence interval of parameter estimates (precision of the estimate).

The Institute uses the following 3 categories to grade the degree of qualitative certainty at the individual study level and outcome level:

high qualitative certainty of results: results on an outcome from a randomized study with a low risk of bias

moderate qualitative certainty of results: results on an outcome from a randomized study with a high risk of bias

low qualitative certainty of results: results on an outcome from a non-randomized comparative study

In the inference of the evidence base for an outcome, the number of available studies, their qualitative certainties of results, as well as the effects found in the studies are of crucial importance. If at least 2 studies are available, it is first distinguished whether, due to existing heterogeneity within a meta-analysis (see Section 8.3.8), a common effect estimate can be meaningfully formed or not. In the case of homogenous results that can be meaningfully pooled, the common effect estimate must be statistically significant to infer proof, an indication or a hint according to the existing certainty of results. If the estimated results are too heterogeneous to meaningfully form a pooled common effect estimate, one distinguishes between effects that are “not in the same direction”, “moderately in the same direction” and “clearly in the same direction”. These are defined as follows:

Effects in the same direction are present if the prediction interval for displaying heterogeneity in a meta-analysis with random effects (see Section 8.3.8) is presented and does not cover the zero effect. In other cases (no presentation of the prediction interval or this interval covers the zero effect) effects in the same direction are present in the following situation:

The effect estimates of 2 or more studies point in the same direction. For these “directed” studies, all of the following conditions apply:

The overall weight of these studies is ≥ 80%.

At least 2 of these studies show statistically significant results.

At least 50% of the weight of these studies is based on statistically significant results.

In this context, the weights of these studies generally come from a meta-analysis with random effects (see Section 8.3.8). If no meta-analysis is meaningful, the relative sample size corresponds to the weight.

If effects in the same direction are moderately or clearly in the same direction, if possible, a decision is made on the basis of the location of the prediction interval. As the prediction interval is generally only presented if at least 4 studies are available (see Section 8.3.8), the classification into effects that are moderately or clearly in the same direction depends on the number of studies.

2 studies: Effects in the same direction are always clearly in the same direction.

3 studies:

All studies show statistically significant results. The effects in the same direction are clearly in the same direction.

Not all of the 3 studies show statistically significant results. The effects in the same direction are moderately in the same direction.

4 or more studies:

All studies show statistically significant results in the same direction of effects: The effects in the same direction are clearly in the same direction.

The prediction interval does not cover the zero effect: The effects in the same direction are clearly in the same direction.

The prediction interval covers the zero effect: The effects in the same direction are moderately in the same direction.

For the case that the available studies show the same qualitative certainty of results or only one study is available, with these definitions the regular requirements for the evidence base to infer conclusions with different certainties of conclusions can be specified. As described above, the Institute distinguishes between 3 different certainties of conclusions: “proof”, “indication” and “hint”.

A conclusion on proof generally requires that a meta-analysis of studies with a high qualitative certainty of results shows a corresponding statistically significant effect. If a meta-analysis cannot be conducted, at least 2 studies conducted independently of each other and showing a high qualitative certainty of results and a statistically significant effect should be present, the results of which are not called into question by further comparable studies with a high certainty of results (consistency of results). These 2 studies do not need to have an exactly identical design. Which deviations in design between studies are still acceptable depends on the research question. Accordingly, a meta-analysis of studies with a moderate qualitative certainty of results or a single study with a high qualitative certainty of results can generally provide only an indication, despite statistically significant effects.

On the basis of only one study, in exceptional cases proof can be inferred for a specific (sub)population with regard to an outcome. This requires the availability of a clinical study report according to the International Conference on Harmonization (ICH) guidelines and the fulfilment of the other requirements stipulated for proof. In addition, the study must fulfil the following specific requirements:

The study is a multi-centre study with at least 10 centres.

The effect estimate observed has a very small corresponding p-value (p < 0.001).

The result is consistent within the study. For the (sub)population of interest, analyses of different further subpopulations are available (particularly subpopulations of study centres), which in each case provide evaluable and sufficiently homogeneous effect estimates. This assessment of consistency is only possible for binary data if a certain minimum number of events has occurred.

The analyses for the subpopulations addressed above are available for all relevant outcomes, i.e. these analyses are not restricted to individual selected outcomes.

It is possible that in the case of the existence of only one study, which alone provides only an indication or a hint, the evidence base may be changed by additional indirect comparisons. However, high methodological demands must be placed on indirect comparisons (see Section 8.3.9). In addition, in the case of a homogeneous data situation, it is possible that by adding indirect comparisons the precision of the effect estimate increases, which plays an important role when determining the extent of added benefit (see Section 3.3.3).

A meta-analysis of studies with a low qualitative certainty of results or an individual study with a moderate qualitative certainty of results (both with a statistically significant effect) generally only provides a hint.

An overview of the regular operationalization is shown in Table 2. In justified cases further factors influence these evaluations. The assessment of surrogate endpoints (see Section 3.1.2), the presence of serious deficiencies in study design or justified doubts about the transferability to the treatment situations in Germany may, for example, lead to a reduction in the certainty of conclusions. On the other hand, great effects or a clear direction of an existing risk of bias, for example, can justify an increase in certainty.

Table 2. Certainty of conclusions regularly inferred for different evidence situations if studies with the same qualitative certainty of results are available.

Table 2

Certainty of conclusions regularly inferred for different evidence situations if studies with the same qualitative certainty of results are available.

If several studies with a different qualitative certainty of results are available, then first only the studies with the higher-quality certainty of results are examined, and conclusions on the evidence base are inferred on this basis according to Table 2. In the inference of conclusions on the evidence base for the whole study pool the following principles then apply:

The conclusions on the evidence base, when restricted to higher-quality studies, are not weakened by the addition of the other studies, but at best upgraded.

The confirmation (replication) of a statistically significant result of a study with a high qualitative certainty of results, which is required to infer proof, can be provided by one or more results of moderate (but not low) qualitative certainty of results. In this context the weight of the study with a high qualitative certainty of results should have an appropriate size (between 25 and 75%).

If the meta-analytical result for the higher-quality studies is not statistically significant or if no effects in the same direction are shown in these studies, then conclusions on the evidence base are to be inferred on the basis of results of the whole study pool, whereby the certainty of conclusions is determined by the minimum qualitative certainty of results of all studies included.

According to these definitions and principles, a corresponding conclusion on benefit is inferred for each outcome separately. Considerations on the assessment across outcomes are presented in the following section (see Section 3.1.5).

3.1.5. Summarizing assessment

These conclusions, drawn separately for each patient-relevant outcome within the framework of the deduction of conclusions on the evidence base, are then summarized (as far as possible) in one evaluating conclusion in the form of a weighing of benefits and harms. If proof of a(n) (added) benefit and/or harm exists with regard to Outcomes 1 to 3 of Section 3.1.1, the Institute presents (insofar as is possible on the basis of the data available)

  1. the benefit
  2. the harm
  3. (if appropriate) the weighing of benefit and harm

In this context, characteristics related to age, gender, and personal circumstances are considered.

One option in the conjoint evaluation of benefit and harm is to compare the outcome-related beneficial and harmful aspects of an intervention.

In this context, the effects on all outcomes (qualitative or semi-quantitative as in the early benefit assessment according to §35a SGB V) are weighed against each other, with the aim of drawing a conclusion across outcomes with regard to the benefit or added benefit of an intervention.

A further option in the conjoint evaluation is to aggregate the various patient-relevant outcomes into a single measure or to reach an overall conclusion by weighting them. The conjoint evaluation of benefit and harm is specified depending on the topic of interest (see also Section 4.3.3).

3.2. Special aspects of the benefit assessment

3.2.1. Impact of unpublished study results on conclusions

An essential prerequisite for the validity of a benefit assessment is the complete availability of the results of the studies conducted on a topic. An assessment based on incomplete data or possibly even selectively compiled data may produce biased results [179,295] (see also Section 8.3.11).

The distortion of published evidence through publication bias and outcome reporting bias has been described comprehensively in the literature [160,390,522]. In order to minimize the consequences of such bias, the Institute has extended information retrieval beyond a search in bibliographic databases, for example, by screening trial registries. In addition, at the beginning of an assessment the Institute normally contacts the manufacturers of the drugs or medical devices to be assessed, and requests the transfer of complete information on studies investigating these interventions (see also Section 7.1.5).

This transfer of information by manufacturers can only solve the problem of bias caused by unpublished evidence if the transfer is itself not selective but complete. An incomplete transfer of information carries a risk of bias for the result of the benefit assessment. This risk should be considered by the Institute in the conclusions of a benefit assessment.

Table 3 below describes what constellations carry a risk of bias for assessment results, and what consequences arise for the conclusions of a benefit assessment.

Table 3. Scenarios for data transfer by third parties and consequences for the conclusions of a benefit assessment.

Table 3

Scenarios for data transfer by third parties and consequences for the conclusions of a benefit assessment.

If the data transfer was complete and no evidence is available that a relevant amount of data is missing, bias seems improbable (Scenario 1). The inferences drawn from the assessment of data can therefore be adopted without limitation in the conclusions of the benefit assessment.

If the data transfer is incomplete, the consequences for the conclusions depend on whether additional search steps demonstrate that a relevant amount of data is missing. If this is not the case (Scenario 2), bias may still be possible, as data transfer may have been selective and further unpublished data may exist that were not identified by the search steps. In such cases the conclusions are therefore drawn with reservations. If it was demonstrated that a relevant amount of data is missing (Scenario 3), it can be assumed that the data transfer was selective. In this situation, further analysis of the available limited data and any conclusions inferred from them with regard to benefit or harm are probably seriously biased and therefore do not form a valid decision-making basis for the G-BA. Consequently, no proof (nor indication nor hint) of a benefit or harm of the intervention to be assessed can be determined in this situation, independently of whether the available data show an effect of the intervention or not.

If the manufacturer completely transfers data and additional literature searches demonstrate that a relevant amount of data from studies inaccessible to the manufacturer is missing (Scenario 4), then no selective data transfer by the manufacturer is evident. In this situation, bias caused by missing data is still possible. The conclusions are therefore drawn with reservation.

3.2.2. Dramatic effect

If the course of a disease is certainly or almost certainly predictable, and no treatment options are available to influence this course, then proof of a benefit of a medical intervention can also be provided by the observation of a reversal of the (more or less) deterministic course of the disease in well-documented case series of patients. If, for example, it is known that it is highly probable that a disease leads to death within a short time after diagnosis, and it is described in a case series that, after application of a specific intervention, most of those affected survive for a longer period of time, then this “dramatic effect” may be sufficient to provide proof of a benefit. An example of such an effect is the substitution of vital hormones in diseases with a failure of hormone production (e.g. insulin therapy in patients with diabetes mellitus type 1). An essential prerequisite for classification as a “dramatic effect” is sufficiently reliable documentation of the fateful course of the disease in the literature and of its diagnosis in the patients included in the study to be assessed. In this context, possible harms of the intervention should also be taken into account. Glasziou et al. [214] have attempted to operationalize the classification of an intervention as a “dramatic effect”. In a first approach they propose to regard an observed effect as not explicable solely by the impact of confounding factors if it was significant at a level of 1% and, expressed as the relative risk, exceeded the value of 10 [214]. This magnitude serves as orientation for the Institute and does not represent a rigid threshold. Glasziou et al. [214] made their recommendation on the basis of results of simulation studies, according to which an observed relative risk of 5 to 10 can no longer be plausibly explained only by confounding factors. This illustrates that a corresponding threshold also depends on the attendant circumstances (among other things, the quality of studies used to determine the existence of a dramatic effect). This dependence is also reflected in the recommendations of other working groups (e.g. the GRADE group) [342].

If, in the run-up to the work on a specific research question, sufficient information is available indicating that a dramatic effect caused by the intervention to be assessed can be expected (e.g. because of a preliminary literature search), then information retrieval will also include a search for studies that show a higher uncertainty of results due to their design.

3.2.3. Study duration

Study duration is an essential criterion in the selection of studies relevant to the benefit assessment. In the assessment of a therapeutic intervention for acute diseases where the primary objective is, for example, to shorten disease duration and alleviate acute symptoms, it is not usually meaningful to call for long-term studies, unless late complications are to be expected. On the other hand, in the assessment of therapeutic interventions for chronic diseases, short-term studies are not usually suitable to achieve a complete benefit assessment of the intervention. This especially applies if treatment is required for several years, or even lifelong. In such cases, studies covering a treatment period of several years are particularly meaningful and desirable. As both benefits and harms can be distributed differently over time, in long-term interventions the meaningful comparison of the benefits and harms of an intervention is only feasible with sufficient certainty if studies of sufficient duration are available. However, individual aspects of the benefits and harms may quite well be investigated in short-term studies.

With regard to the selection criterion “minimum study duration”, the Institute primarily follows standards for demonstrating the effectiveness of an intervention. In the assessment of drugs, the Institute will in particular resort to information provided in guidelines specific to therapeutic indications, which are published by regulatory authorities (e.g. [176]). As the benefit assessment of an intervention also includes aspects of harm, the generally accepted standards in this respect are also relevant when determining the minimum study duration. Moreover, for long-term interventions as described above, the Institute will resort to the relevant guidelines for the criterion “long-term treatment” [282]. In individual cases, the Institute may deviate from this approach (and will justify this deviation), for example, if a topic requires longer follow-up, or if specific (sub)questions apply to a shorter period. Such deviations may also be indicated if short-term effects are a subject of the assessment (e.g. in the assessment of newly available/approved interventions and/or technologies where no appropriate treatment alternative exists).

3.2.4. Patient-reported outcomes

The patient-relevant dimensions of benefit outlined in Section 3.1.1 can also include patient-reported outcomes (PROs). In addition to health-related quality of life, PROs can also cover other dimensions of benefit, for example, disease symptoms. As in the assessment of quality of life, instruments are required that are suitable for use in clinical trials [174]. In the selection of evidence (especially study types) to be considered for the demonstration of an effect, the same principles as with other outcomes usually apply [198]. This means that also for PROs (including health-related quality of life, symptoms, and treatment satisfaction), RCTs are best suited to demonstrate an effect.

As information on PROs is subjective due to their nature, open studies in this area are of limited validity. The size of the effect observed is an important decision criterion for the question as to whether an indication of a benefit of an intervention with regard to PROs can be inferred from open studies. Empirical evidence shows a high risk of bias for subjective outcomes in open studies [600]. This should be considered in their interpretation (see also Sections 8.1.4 and 8.3.4). However, situations are conceivable where blinding of physicians and patients is not possible. In such situations, if possible, other efforts are required to minimize and assess bias (e.g. blinded documentation and assessment of outcomes). Further aspects on the quality assessment of studies investigating PROs are outlined in [198].

3.2.5. Benefits and harms in small populations

In small populations (e.g. patients with rare diseases or special subgroups of patients with common diseases), there is no convincing argument to deviate in principle from the hierarchy of evidence levels. In this connection, it is problematical that no international standard definition exists as to what is to be understood under a “rare” disease [598]. Independent of this, patients with rare diseases also have the right to the most reliable information possible on treatment options [171]. Non-randomized studies require larger sample sizes than randomized ones because of the need of adjustment for confounding factors. However, due to the rarity of a disease it may sometimes be impossible to include enough patients to provide the study with sufficient statistical power. A meta-analytical summary of smaller studies may be particularly meaningful in such cases. Smaller samples generally result in lower precision in an effect estimate, accompanied by wider confidence intervals. Because of the relevance of the assumed effect of an intervention, its size, the availability of treatment alternatives, and the frequency and severity of potential therapy-related harms, for small sample sizes it may be meaningful to accept a higher p-value than 5% (e.g. 10%) to demonstrate statistical significance, thus increasing quantitative uncertainty. Similar recommendations have been made for other problematical constellations [173]. Such an approach must, however, be specified a priori and well justified. Likewise, for small sample sizes it may be more likely that is necessary to substitute a patient-relevant outcome that occurs too rarely with surrogate endpoints. However, these surrogates must also be valid for small sample sizes [175].

In the case of extremely rare diseases or very specific disease constellations, the demand for (parallel) comparative studies may be inappropriate [598]. Nevertheless, in such cases it is also possible at least to document and assess the course of disease in such patients appropriately, including the expected course without applying the intervention to be assessed (e.g. using historical patient data) [82]. The fact that a situation is being assessed involving an extremely rare disease or a very specific disease constellation is specified and explicitly highlighted in the report plan.

3.3. Benefit assessment of drugs

One main objective of the benefit assessment reports on drugs is to support the G-BA's decisions on directives concerning the reimbursement of drugs by the SHI. For this purpose, it is necessary to describe whether a drug's benefit has been demonstrated (or whether, when compared with a drug or non-drug alternative, a higher benefit [added benefit] has been demonstrated).

The G-BA's decisions on directives do not usually consider particular cases, but the general one. Consequently, the Institute's reports do not usually refer to decisions on particular cases.

Because of the objective of the Institute's benefit assessments, these assessments only include studies with an evidence level principally suited to demonstrate a benefit of an intervention. Thus, studies that can only generate hypotheses are generally not relevant for the benefit assessment. The question as to whether a study can demonstrate a benefit mainly depends on the certainty of results of the data analysed.

3.3.1. Relevance of the drug approval status

The commissioning of the Institute by the G-BA to assess the benefit of drugs usually takes place within the framework of the approval status of the drug to be investigated (therapeutic indication, dosage, contra-indications, concomitant treatment, etc.). For this reason, the Institute's recommendations to the G-BA, which are formulated in the conclusions of the benefit assessment report, usually refer to the use of the assessed drug within the framework of the current approval status.

It is clarified on a project-by-project basis how to deal with studies (and the evidence inferred from them) that were not conducted according to the use of a drug as outlined in the approval documents. In principle, it is conceivable that studies in which a drug was used outside the scope of the approval status described in the Summary of Product Characteristics (“off-label use”), over- or underestimated a drug's benefit and/or harm. This may lead to a misjudgement of the benefit and/or harm in patients treated within the framework of the drug's approval status. However, if it is sufficiently plausible or has even been demonstrated that the results obtained in these studies are applicable to patients treated according to the drug's approval status, these results can be considered in the benefit assessment.

Therefore, for studies excluded from the assessment only because they were off-label studies (or because it was unclear whether they fulfilled the requirements of the approval status), each case is assessed to establish to what extent the study results are applicable to patients treated according to the approval requirements.

Results from off-label studies are regarded as “applicable” if it is sufficiently plausible or has been demonstrated that the effect estimates for patient-relevant outcomes are not greatly affected by the relevant characteristic of the drug approval status (e.g. pretreatment required). As a rule, the equivalence of effects should be proven with appropriate scientific studies. These studies should be targeted towards the demonstration of equivalence of the effect between the group with and without the characteristic. Results applicable to patients treated according to a drug's approval status can be considered in the conclusion of the assessment.

Results from studies are regarded as “not applicable” if their applicability has not been demonstrated and if plausible reasons against the transferability of results exist. As a rule, study results are regarded to be “not applicable” if, for example, the age range or disease severity treated lay outside the approved range or severity, if off-label combinations including other active ingredients were used, or if studies were conducted in patients with contraindications for the intervention investigated. The results of these studies are not presented in the reports, as they cannot be considered in the assessment

If results from off-label studies are regarded as applicable, this is specified in the report plan. As a rule the results of studies showing the following characteristics are discussed, independently of the applicability of study results to the use specified in the approval of the drug:

They refer to patients with the disease specified in the commission.

They refer to patients treated with the drug to be assessed.

They are of particular relevance due to factors such as sample size, study duration, or outcomes investigated.

3.3.2. Studies on the benefit assessment of drugs

The results of the Institute's benefit assessment of drugs may have an impact on patient health care in Germany. For this reason, high standards are required regarding the certainty of results of studies included in the benefit assessment.

The certainty of results is defined as the certainty with which an effect (or the lack of an effect) can be inferred from a study. This refers to both “positive” aspects (benefit) as well as “negative” aspects (harm). The certainty of results of an individual study is essentially influenced by 3 components:

the study design

the internal validity (which is design-specific and determined by the specific way the study was conducted)

the size of an expected or observed effect

In the benefit assessment of drugs, not only individual studies are assessed, but the results of these studies are incorporated into a systematic review. The certainty of results of a systematic review is in turn based on the certainty of results of the studies included. In addition, it is determined in particular by the following factor:

the consistency of the results of several studies

The study design has considerable influence on the certainty of results insofar as a causal association between intervention and effect cannot usually be shown with prospective or retrospective observational studies, whereas controlled intervention studies are in principle suited for this purpose [226]. This particularly applies if other factors influencing results are completely or almost completely eliminated. For this reason, an RCT represents the gold standard in the assessment of drug and non-interventions [422].

In the assessment of drugs, RCTs are usually possible and practically feasible. As a rule, the Institute therefore considers RCTs in the benefit assessment of drugs and only uses non-randomized intervention studies or observational studies in justified exceptional cases. Reasons for exception are, on the one hand, the non-feasibility of an RCT (e.g. if the therapist and/or patient have a strong preference for a specific therapy alternative) or, on the other, the fact that other study types may also provide sufficient certainty of results for the research question posed. For diseases that would be fatal within a short period of time without intervention, several consistent case reports may provide sufficient certainty of results that a particular intervention prevents this otherwise inevitable course [358] (dramatic effect, see also Section 3.2.2). The special obligation to justify a non-randomized design when testing drugs can also be found within the framework of drug approval legislation in the directives on the testing of medicinal products (Directive 2001/83/EC, Section 6.2.5 [332]).

In the preparation of the report plan (see also Section 2.1.1), the Institute therefore determines beforehand which study types can be regarded as feasible on the basis of the research question posed, and provide sufficient certainty of results (with high internal validity). Studies not complying with these minimum quality standards (see also Section 8.1.4) are not given primary consideration in the assessment process.

Sections 3.1.4 and 8.1 present information on the assessment of the internal validity of studies, as well as on further factors influencing certainty of results, such as the consistency of the results of several studies and the relevance of the size of the effect to be expected.

In addition to characterizing the certainty of results of the studies considered, it is necessary to describe whether – and if yes, to what extent – the study results are transferable to local settings (e.g. population, health care sector), or what local study characteristics had (or could have had) an effect on the results or their interpretation. From this perspective, studies are especially relevant in which the actual German health care setting is represented as far as possible. However, the criteria for certainty of results outlined above must not be ignored. Finally, the transferability of study results (generalizability or external validity) must be assessed in a separate process initially independent of the study design and quality.

3.3.3. Benefit assessment of drugs according to §35a SGB V

A benefit assessment of a drug according to §35a SGB V is based on a dossier of the pharmaceutical company in which the company provides the following information:

  1. approved therapeutic indications
  2. medical benefit
  3. added medical benefit compared with an appropriate comparator therapy
  4. number of patients and patient groups for whom a therapeutically relevant added benefit exists
  5. cost of treatment for the SHI
  6. requirements for quality-assured usage of the drug

The requirements for form and content of the dossier are outlined in dossier templates, which are a component of the G-BA's Code of Procedure [211]. In the dossier, specifying the validity of the evidence, the pharmaceutical company must describe the likelihood and the extent of added benefit of the drug to be assessed compared with an appropriate comparator therapy. The information provided must be related both to the number of patients and to the extent of added benefit. The costs for the drug to be assessed and the appropriate comparator therapy must be declared (based on the pharmacy sales price and taking the Summary of Product Characteristics and package information leaflet into account).

The probability of the added benefit describes the certainty of conclusions on the added benefit. In the dossier, the extent of added benefit should be described according to the categories of the Regulation for Early Benefit Assessment of New Pharmaceuticals (ANV14) (major, considerable, minor, non-quantifiable added benefit; no added benefit proven; benefit of the drug to be assessed smaller than benefit of the appropriate comparator therapy) [80].

In the benefit assessment the validity and completeness of the information in the dossier are examined. It is also examined whether the comparator therapy selected by the pharmaceutical company can be regarded as appropriate in terms of §35a SGB V and the ANV. In addition, the Institute assesses the effects described in the documents presented, taking the certainty of results into account. In this assessment, the qualitative and quantitative certainty of results within the evidence presented, as well as the size of observed effects and their consistency, are appraised. The benefit and cost assessments are conducted on the basis of the standards of evidence-based medicine described in this methods paper and those of health economic standards, respectively. As a result of the assessment, the Institute presents its own conclusions, which may confirm or deviate from those arrived at by the pharmaceutical company (providing a justification in the event of deviation).

The operationalization for determining the extent of added benefit comprises 3 steps:

  1. In the first step the probability of the existence of an effect is examined for each outcome separately (qualitative conclusion). For this purpose, the criteria for inferring conclusions on the evidence base are applied (see Section 3.1.4). Depending on the quality of the evidence, the probability is classified as a hint, an indication or proof.
  2. In the second step, for those outcomes where at least a hint of the existence of an effect was determined in the first step, the extent of the effect size is determined for each outcome separately (quantitative conclusion). The following quantitative conclusions are possible: major, considerable, minor, and non-quantifiable.
  3. In the third and last step, the overall conclusion on the added benefit according to the 6 specified categories is determined on the basis of all outcomes, taking into account the probability and extent at outcome level within the overall picture. These 6 categories are as follows: major, considerable, minor, and non-quantifiable added benefit; no added benefit proven; the benefit of the drug under assessment is less than the benefit of the appropriate comparator therapy.

The quality of the outcome, as well as the effect size, are essential in determining the extent at outcome level in the second step. The rationale for this operationalization is presented in the Appendix ARationale of the methodological approach for determining the extent of added benefit. The basic approach aims to derive thresholds for confidence intervals for relative effect measures depending on the effects to be achieved, which in turn depend on the quality of the outcomes and the extent categories.

It will not always be possible to quantify the extent at outcome level. For instance, if a statistically significant effect on a sufficiently valid surrogate is present, but no reliable estimate of this effect on a patient-relevant outcome is possible, then the (patient-relevant) effect cannot be quantified. In such and similar situations, an effect of a non-quantifiable extent is concluded, with a corresponding explanation.

On the basis of the case of a quantifiable effect, the further approach depends on the scale of the outcome. One distinguishes between the following scales:

binary (analyses of 2×2 tables)

time to event (survival time analyses)

continuous or quasi-continuous, in each case with available responder analyses (analyses of mean values and standard deviations)

other (e.g. analyses of nominal data)

In the following text, first the approach for binary outcomes is described. The other scales are subsequently based on this approach.

On the basis of the effect measure “relative risk”, denominator and numerator are always chosen in such a way that the effect (if present) is realized as a value < 1, i.e. the lower the value, the stronger the effect.

A. Binary outcomes

To determine the extent of the effect in the case of binary outcomes, the two-sided 95% confidence interval for the relative risk is used; if appropriate, this is calculated by the Institute itself. If several studies are pooled quantitatively, the meta-analytical result for the relative risk is used.

Depending on the quality of the outcome, the confidence interval must lie completely below a certain threshold for the extent to be regarded as minor, considerable or major. It is thus decisive that the upper limit of the confidence interval is smaller than the respective threshold.

The following 3 categories for the quality of the outcome are formed:

all-cause mortality

serious (or severe) symptoms (or late complications) and adverse events, as well as health-related quality of life

non-serious (or non-severe) symptoms (or late complications) and adverse events

The thresholds are specified separately for each category. The more serious the event, the bigger the thresholds (in terms of lying closer to 1). The greater the extent, the smaller the thresholds (in terms of lying further away from 1). For the 3 extent categories (minor, considerable, major), the following Table 4 shows the thresholds to be undercut for each of the 3 categories of quality of the outcomes.

Table 4. Thresholds for determining the extent of an effect.

Table 4

Thresholds for determining the extent of an effect.

The relative risk can generally be calculated in 2 ways, depending on whether the risk refers to events or counter-events (e.g. survival vs. death, response vs. non-response). This is irrelevant for the statement on significance specified in Step 1 of the approach (conventional, non-shifted hypotheses), as in such a case the p-value of a single study is invariant and plays a subordinate role in meta-analysis. However, this does not apply to the distance of the confidence interval limits to the zero effect. To determine the extent of effect for each binary outcome (by means of content criteria under consideration of the type of outcome and underlying disease), it must therefore be decided what type of risk is to be assessed, that of an event or counter-event.

B. Time to event

The two-sided 95% confidence interval for the hazard ratio is required to determine the extent of the effect in the case of outcomes representing a “time to event”. If several studies are pooled quantitatively, the meta-analytical result for the hazard ratio is used. If the confidence interval for the hazard ratio is not available, it is approximated on the basis of the available information, if possible [553]. The same limits as for the relative risk are set for determining the extent (see Table 4).

If a hazard ratio is neither available nor calculable, or if the available hazard ratio cannot be interpreted meaningfully (e.g. due to relevant violation of the proportional hazard assumption), it should be examined whether a relative risk (referring to a meaningful time point) can be calculated. It should also be examined whether this operationalization is adequate in the case of transient outcomes for which the outcome “time to event” was chosen. If appropriate, the calculation of a relative risk at a time point is also indicated here.

C. Continuous or quasi-continuous outcomes, in each case with available responder analyses

Responder analyses are used to determine the extent of added benefit in the case of continuous or quasi-continuous outcomes. For this purpose, a validated or established response criterion or cut-off value is required. On the basis of the responder analyses (2×2 tables) the relative risks are calculated directly from them. According to Table 4 the extent of the effect is then determined.

D. Other outcomes

In the case of other outcomes where no responder analyses with inferable relative risks are available either, it should be examined in the individual case whether relative risks can be approximated [116] to set the corresponding thresholds for determining the extent. Otherwise the extent is to be classified as “non-quantifiable”.

For the third step of the operationalization of the overall conclusion on the extent of added benefit, when all outcomes are examined together, a strict formalization is not possible, as no sufficient abstraction is currently known for the value judgements to be made in this regard. In its benefit assessment the Institute will compare the conclusions on probability and on the extent of the effects and provide a justified proposal for an overall conclusion.

3.4. Non-drug therapeutic interventions

Even if the regulatory preconditions for the market access of drugs and non-drug therapeutic interventions differ, there is nevertheless no reason to apply a principally different standard concerning the certainty of results in the assessment of the benefits and harms of an intervention. For example, the G-BA's Code of Procedure [211] envisages, as far as possible, the preferential consideration of RCTs, independent of the type (drug/non-drug) of the medical intervention to be assessed. For medical devices, this is weakened by the conformity evaluation in the current DIN EN ISO Norm 14155 (Section A.6.1 [138]), where RCTs are not presented as the design of choice; however, the choice of design must be justified.

Compared with studies on drug interventions, studies on non-drug interventions are often associated with specific challenges and difficulties [389]. For example, the blinding of the staff performing the intervention will often be impossible, and the blinding of patients will either be difficult or also impossible. In addition, it can be assumed that therapists' and patients' preferences for certain treatment options will make the feasibility of studies in these areas particularly problematic. In addition, it may be necessary especially in the assessment of complex interventions to consider the possibility of contamination effects. It may also be necessary to consider the distinction between the effects caused by the procedure or (medical) device to be assessed on the one hand, and those caused by the expertise and skills of those applying the intervention on the other. Moreover, depending on the time of assessment, learning effects need to be taken into account.

In order to give consideration to the aspects outlined above, studies of particularly good quality are required in order to achieve sufficient certainty of results. Paradoxically, the opposite has rather been the case in the past; i.e. sound randomized studies are often lacking, particularly in the area of non-drug interventions (e.g. in surgery [389]). In order to enable any conclusions at all to be drawn on the relevance of a specific non-drug therapeutic intervention, it may therefore also be necessary to consider non-randomized studies in the assessment. Nonetheless, quality standards also apply in these studies, in particular regarding measures taken to ensure structural equality. However, such studies will usually at best be able to provide hints of a(n) (added) benefit or harm of an intervention due to their inherently lower certainty of results. The inclusion of studies with lower evidence levels is consistent with the corresponding regulation in the G-BA's Code of Procedure [211]. However, the specific obligation to provide a justification is emphasized. In this regulation it is noted: “However, in order to protect patients, recognition of a method's medical benefit on the basis of documents with lower evidence levels requires all the more justification the greater the deviation from evidence level 1 (in each case, the medical necessity of the method must also be considered). For this purpose, the method's potential benefit for patients is in particular to be weighed against the risks associated with the demonstration of effectiveness based on studies of lower evidential value” [211]. This means that the non-availability of studies of the highest evidence level alone cannot generally be viewed as sufficient justification for a benefit assessment based on studies with lower evidence levels.

In the assessment of non-drug therapeutic interventions, it may also be necessary to consider the marketability or CE marking (according to the German Medical Devices Act) and the approval status of drugs (according to the German Pharmaceutical Act), insofar as the test interventions or comparator interventions comprise the use of medical devices or drugs (see Section 3.3.1). The corresponding consequences must subsequently be specified in the report plan (see Section 2.1.1).

3.5. Diagnostic tests

Diagnostic tests are characterized by the fact that their health-related benefit (or harm) is in essence only realized if the tests are followed by therapeutic or preventive procedures. The mere acquisition of diagnostic information (without medical consequences) as a rule has no benefit from the perspective of social law.

This applies in the same way both to diagnostic information referring to the current state of health and to prognostic information (or markers) referring to a future state of health. In the following text, procedures to determine diagnostic or prognostic information are therefore jointly regarded as diagnostic tests.

In general, the evaluation process for diagnostic tests can be categorized into different hierarchy phases or levels, analogously to the evaluation of drugs [204,329]. Phase 4 prospective, controlled diagnostic studies according to Köbberling et al. [329], or Level 5 studies according to Fryback and Thornbury [204] have an (ideally random) allocation of patients to a strategy with or without application of the diagnostic test to be assessed or to a group with or without disclosure of the (diagnostic) test results. These studies can be seen as corresponding to Phase 3 (drug) approval trials (“efficacy trials”). Accordingly, they are allocated to the highest evidence level (see, for example, the G-BA's Code of Procedure [211]). The US Food and Drug Administration also recommends such studies for specific indications in the approval of drugs and biological products developed in connection with diagnostic imaging techniques [197]. Examples show that they can be conducted with comparatively moderate effort [16,568].

The Institute follows this logic and primarily conducts benefit assessments of diagnostic tests on the basis of studies designed as described above that investigate patient-relevant outcomes. The main features of the assessment comply with the explanations presented in Sections 3.1 and 3.4. In this context, patient-relevant outcomes refer to the same benefit categories as in the assessment of therapeutic interventions, namely mortality, morbidity, and health-related quality of life. The impact of diagnostic tests on these outcomes can be achieved by the avoidance of high(er) risk interventions or by the (more) targeted use of interventions. If the collection of diagnostic or prognostic information itself is associated with a high(er) risk, a lower-risk diagnostic test may have patient-relevant advantages, namely, if (in the case of comparable test quality) the conduct of the test itself causes lower mortality and morbidity rates, or fewer restrictions in quality of life.

Conclusions on the benefit of diagnostic tests are ideally based on randomized studies, which can be conducted in various ways [50,51,188,360,378,484]. In a study with a strategy design including 2 (or more) patient groups, in each case different strategies are applied, which in each case consist of a diagnostic measure and a therapeutic consequence. A high informative value is also ascribed to randomized studies in which all patients initially undergo the conventional and the diagnostic test under investigation; subsequently, only those patients are randomized in whom the latter test produced a different result, and thereby a different therapeutic consequence, than the former test (discordance design). Studies in which the interaction between the diagnostic or prognostic information and the therapeutic benefit is investigated also have a high evidence level and should as a matter of priority be used for the benefit assessment of diagnostic tests (interaction design [484,541]). Many diagnostic or prognostic characteristics – especially genetic markers – can also be determined retrospectively in prospective comparative studies and examined with regard to a potential interaction (so-called “prospective-retrospective” design [516]). The validity of such “prospective-retrospective” designs depends especially on whether the analyses were planned prospectively (in particular also the specification of threshold values). Moreover, in all studies with an interaction design it is important that the treatments used correspond to the current standard, that the information (e.g. tissue samples) on the characteristic of interest is completely available for all study participants or at least for a representative sample, and that if several characteristics are analysed the problem of multiple testing for significance is adequately accounted for (see also Section 8.3.2 [485]).

Overall, it is less decisive to what extent diagnostic or prognostic information can determine a current or future state of health, but rather that this information is of predictive relevance, namely, that it can predict the greater (or lesser) benefit of the subsequent treatment [188,517]. For this – necessarily linked – assessment of the diagnostic and therapeutic intervention it is important to note that overall, a benefit can normally only arise if both interventions fulfil their goal: If either the predictive discriminative capacity of the diagnostic intervention is insufficient or the therapeutic intervention is ineffective, a study will not be able to show a benefit of the diagnostic intervention.

Besides a strategy and interaction design, a third main form of RCTs on diagnostic questions is available with the enrichment design [379,541]. In this design, solely on the basis of the diagnostic test under investigation, only part of the patient population is randomized (and thus included); for example, only test-positive patients, who then receive 1 of 2 treatment options. In comparison with an interaction design, such a design lacks the investigation of a potential treatment effect in the remaining patients (e.g. in the test-negative ones). Robust conclusions can thus only be drawn from such designs if, on the basis of other information, it can be excluded that an effect observed in the randomized patient group could also have existed in the non-randomized group.

The comments above primarily refer to diagnostic tests that direct more patients towards a certain therapeutic consequence by increasing the test quality (i.e. sensitivity, specificity or both). In these cases, as a rule it is necessary to examine the impact of the diagnostic test on patient-relevant outcomes by covering the whole diagnostic and therapeutic chain. However, it is possible that the diagnostic test under investigation is only to replace a different and already established diagnostic test, without identifying or excluding additional patients. If the new test shows direct patient-relevant advantages, for example, is less invasive or requires no radiation, it will not always be necessary to re-examine the whole diagnostic-therapeutic chain, as the therapeutic consequences arising from the new test do not differ from those of the previous test [42,51,394]. To demonstrate benefit, in these cases test quality studies could be sufficient in which it is shown that the test result of the previous test (= reference standard) and that of the test under investigation (= index test) is identical in a sufficiently high proportion of patients (one-sided question of equivalence).

On the other hand, for a comparison of 2 or more diagnostic tests with regard to certain test quality characteristics, studies with the highest certainty of results, and thus primarily considered in the Institute's reports, are studies with a random allocation of the sequence of the conduct of the tests (which are independent of each other and preferably blinded) in the same patients or with a random allocation of the tests to different patients.

If a study is to provide informative data on the benefit, diagnostic quality or prognostic value of a diagnostic test, it is essential to compare it with the previous diagnostic approach [542]. Only in this way can the added value of the diagnostic or prognostic information be reliably determined. For studies on test accuracy this means that, besides sensitivity and specificity of the new and previous method, it is of particular interest to what extent the diagnostic measures produce different results per patient. In contrast, in studies on prognostic markers multifactorial regression models often play a key role, so that Section 8.3.7 should be taken into account. When selecting non-randomized designs for diagnostic methods, the ranking of different study designs presented in Section 8.1.3 should as a rule be used.

In the assessment of the certainty of results of studies on diagnostic accuracy, the Institute primarily follows the QUADAS-215 criteria [592,593], which, however, may be adapted for the specific project. The STARD16 criteria [52,53] are applied in order to decide on the inclusion or exclusion of studies not published in full text on a case-by-case basis (see also Sections 8.1.4 and 8.3.11). Despite some individual good proposals, there are no generally accepted quality criteria for the methodological assessment of prognosis studies [11,251,252,515]. Only general publication standards exist for studies on prognostic markers [579], however, there are publication standards for prognostic markers in oncology [14,393].

Level 3 and 4 studies according to Fryback and Thornbury [204] are to investigate the effect of the (diagnostic) test to be assessed on considerations regarding (differential) diagnosis and/or subsequent therapeutic (or other management) decisions, i.e. it is investigated whether the result of a diagnostic test actually leads to any changes in decisions. However, such studies or study concepts have the major disadvantage that they are not sharply defined, and are therefore of rather theoretical nature. A principal (quality) characteristic of these studies is that it was clearly planned to question the physicians involved regarding the probability of the existence of the disease (and their further diagnostic and/or therapeutic approach) before the conduct of the diagnostic test to be assessed or the disclosure of results. This is done in order to determine the change in attitude caused by the test result. In contrast, retrospective appraisals and theoretical estimates are susceptible to bias [204,239]. The relevance of such ultimately uncontrolled studies within the framework of the benefit assessment of diagnostic (or prognostic) tests must be regarded as largely unclear. Information on management changes alone cannot therefore be drawn upon to provide evidence of a benefit, as long as no information on the patient-relevant consequences of such changes is available.

It is also conceivable that a new diagnostic test is incorporated in an already existing diagnostic strategy; for example, if a new test precedes (triage test) or follows (add-on test) an established test in order to reduce the frequency of application of the established test or new test, respectively [50]. However, against the background of the subsequent therapeutic (or other types of) consequences, it should be considered that through such a combination of tests, the patient populations ensuing from the respective combined test results differ from those ensuing from the individual test results. This difference could in turn influence subsequent therapeutic (or other types of) consequences and their effectiveness. If such an influence cannot be excluded with sufficient certainty – as already described above – comparative studies on diagnostic strategies including and excluding the new test may be required [197,367].

Several individual diagnostic tests or pieces of information are in part summarized into an overall test via algorithms, scores, or similar approaches. In the assessment of such combined tests the same principles should be applied as those applied for individual tests. In particular, the validation and clinical evaluation of each new test must be performed independently of the test development (e.g. specification of a threshold, weighting of scores, or algorithm of the analysis) [531].

Biomarkers used within the framework of “personalized” or better “stratified” medicine should also be evaluated with the methods described here [268,541]. This applies both to biomarkers determined before the decision on the start of a treatment (or of a treatment alternative) and to those determined during treatment in order to decide on the continuation, discontinuation, switching, or adaptation of treatment [520,567]. Here too, it is essential to distinguish between the prognostic and predictive value of a characteristic.

Prognostic markers provide information on the future state of health and normally refer to the course of disease under treatment and not to the natural course of disease without treatment. The fact that a biomarker has prognostic relevance does not mean that it also has predictive relevance (and vice versa).

Finally, in the assessment of diagnostic tests, it may also be necessary to consider the result of the conformity assessment procedure for CE marking and the approval status of drugs used in diagnostics (see Section 3.3.1). The corresponding consequences must subsequently be specified in the report plan (see Section 2.1.1).

3.6. Early diagnosis and screening

Screening programmes are composed of different modules, which can be examined either in part or as a whole [120,513]. The assessment of a screening test generally follows internationally accepted standards and criteria, for example, those of the UK National Screening Committee (UK NSC [564]), the US Preventive Services Task Force (US PSTF [247,437,490]), or the New Zealand National Health Committee (NHC) [406].

According to the criteria outlined above, the Institute primarily assesses the benefit of screening tests by means of prospective comparative intervention studies on the whole screening chain, which include the (ideally random) allocation of participants to a strategy with or without application of the screening test (or to different screening strategies) and which investigate patient-relevant outcomes. In this context, the main features of the assessment comply with the explanations outlined in Sections 3.1 to 3.4.

If such studies are not available or are of insufficient quantity or quality, an assessment of the single components of the screening chain can be performed. In this context, the accuracy of the diagnostic test is assessed by means of generally applied test quality criteria, determined in studies showing sufficient certainty of results (usually Phase 3 according to Köbberling et al. [329]), (see Section 3.5), and it is reviewed to what extent it is proven that the consequences resulting from the test outcomes are associated with a benefit. In the case of therapeutic consequences (which are mostly assumed), proof can be inferred from randomized intervention studies in which an early (earlier) intervention was compared with a late(r) one. The benefit of an early (earlier) vs. a late(r) intervention may also be assessed by means of intervention studies in which the interaction between the earliness of the start of the intervention and the intervention's effect can be investigated. This can be performed either directly within a study or indirectly by comparing studies with different starting points for the intervention, but with otherwise comparable study designs. Here too, the main features of the assessment comply with the explanations outlined in Sections 3.1 to 3.4.

3.7. Prevention

Prevention is directed at avoiding, reducing the probability of, or delaying health impairment [581]. Whereas primary prevention comprises all measures employed before the occurrence of detectable biological impairment in order to avoid the triggering of contributory causes, secondary prevention comprises measures to detect clinically asymptomatic early stages of diseases, as well as their successful early therapy (see also Section 3.6). Primary and secondary prevention measures are characterized by the fact that, in contrast to curative measures, whole population groups are often the focus of the intervention. Tertiary prevention in the narrowest sense describes specific interventions to avoid permanent (especially social) functional deficits occurring after the onset of disease [254]. This is not the focus of this section, but is addressed in the sections on the benefit assessment of drug and non-drug interventions (see Sections 3.3 and 3.4).

The Institute also primarily performs benefit assessments of prevention programmes (other than screening programmes) by means of prospective, comparative intervention studies that have an (ideally random) allocation of participants to a strategy with or without application of the prevention measure, and that investigate patient-relevant outcomes. Alternatively, due to potential “contamination” between the intervention and control group, studies in which clusters were allocated to the study arms may also be eligible [554].

In individual cases, it needs to be assessed to what extent the consideration of other study designs is meaningful [308]. For example, mass-media campaigns are often evaluated within the framework of “interrupted time-series analyses” (e.g. in [572]), and the use of this study design is also advocated for community intervention research [43]. In the quality assessment of these studies, the Institute uses for orientation the criteria developed by the Cochrane Effective Practice and Organisation of Care Review Group [106].

For the benefit on the population level, not only the effectiveness of the programme is decisive, but also the participation rate. In addition, the question is relevant as to which persons are reached by prevention programmes; research indicates that population groups with an increased risk of disease participate less often in such programmes [343]. Special focus is therefore placed on both of these aspects in the Institute's assessments.

3.8. Assessment of potential

In contrast to benefit assessments, assessments of potential aim to investigate whether new examination or treatment methods potentially show a benefit. In this context, “potential” means that firstly, the evidence available so far indicates that a potential benefit may exist, and secondly, that on the basis of this evidence a study can be planned that allows an assessment of the benefit of the method on a sufficiently reliable evidence level; see §14 (3, 4) of the G-BA's Code of Procedure [211].

An assessment of potential according to §137e (7) SGB V is based on an application for which the G-BA has defined the form and required content. Those entitled to apply are manufacturers of a medical device on which the technical application of a new examination or treatment method is largely based, as well as companies that in another way as a provider of a new method have an economic interest in providing their service at the expense of the health insurance funds. The application must contain informative documents especially referring to the current evidence on and the expected benefit of the new examination and treatment method (see §20 (2) No. 5 of the G-BA's Code of Procedure [211]). Optionally a proposal can be submitted on the key points of a testing study. An application for a method can refer to one or several therapeutic indications.

Within the framework of the assessment of potential the Institute evaluates the plausibility of the information provided by the applicant. This evaluation especially refers to the meaningfulness of the medical question(s) presented in the application, the quality of the literature searches conducted by the applicant (see Section 7.2), the assessment of the certainty of results of the relevant studies, and the correctness of the results presented in the application. The assessment leads to a conclusion on the potential of the examination or treatment method applied for. If a potential is determined from the Institute's point of view, the testing study proposed by the applicant is evaluated; if the application does not contain such a proposal or an unsuitable one, the Institute specifies the key points of a possible testing study.

Due to the particular aim, considerably lower requirements for the evidence are imposed in assessments of potential compared with benefit assessments. Ultimately, the aim of testing is first to generate an adequate data basis for a future benefit assessment. Accordingly, a potential can be justified, in particular also on the basis of non-randomized studies. Moreover, further methodological principles of benefit assessments are not used or only used to a limited extent in assessments of potential, as described in the following text.

In contrast to benefit assessments, due to lower requirements for the evidence, in assessments of potential an extended assessment of the qualitative certainty of results of non-randomized studies is performed. In this context, besides the levels mentioned in Section 3.1.4 for randomized studies (high or moderate certainty of results) the following grades are used:

low qualitative certainty of results: result of a higher quality non-randomized comparative study with adequate control for confounders (e.g. quasi-randomized controlled studies, non-randomized controlled studies with active allocation of the intervention following a preplanned rule, prospective comparative cohort studies with passive allocation of the intervention),

very low qualitative certainty of results: result of a higher quality non-randomized comparative study (see point above), but without adequate control for confounders or result of another non-randomized comparative study (e.g. retrospective comparative cohort studies, historically controlled studies, case-control studies),

minimum qualitative certainty of results: result of a non-comparative study (e.g. one-arm cohort studies, observational studies or case series, cross-sectional studies or other non-comparative studies).

An important aspect of the certainty of results is the control for confounders, which can in particular be achieved through the use of multifactorial statistical methods - as described in Section 8.3.7. Further factors are also taken into account in the assessment of the certainty of results (see Section 8.1.4).

High-quality non-randomized studies may also show a considerable risk of bias. When deriving the potential of an intervention from such studies, it must therefore be evaluated whether the available studies show differences regarding the intervention of interest in a magnitude suggesting that a benefit can be demonstrated in suitable future studies, and that these differences cannot be solely explained by the average expected influence of bias. A potential thus particularly arises if studies of a low certainty of results show at least small effects, if studies of a very low certainty of results show at least medium effects, and if studies with a minimum certainty of results show at least large effects. For the relative risk, values of 0.8 and 0.5 can serve as rough thresholds between small, medium and large effects [150,434]. Deviating from the procedure in benefit assessments (see Section 3.1.2), in assessments of potential, surrogate endpoints are also considered for which no sufficient validity has yet been shown. However, surrogate endpoints should be established and plausible so as to be able to justify a potential.

If the potential of diagnostic methods is to be evaluated, data on test accuracy are also considered. In this context, the certainty of results of the underlying studies must be examined (see Sections 3.5 and 8.3.11). In a second step, an evaluation of the plausibility of the diagnostic method is performed with regard to the effects postulated by the applicant in respect of patient-relevant outcomes, that is, possible direct effects of the method, as well as therapeutic consequences via which the diagnostic method could influence patient-relevant outcomes.



Consolidated Standards of Reporting Trials


Grading of Recommendations, Assessment, Development and Evaluation


Arzneimittel-Nutzenbewertungsverordnung, AM-NutzenV


Quality Assessment of Diagnostic Accuracy Studies


Standards for Reporting of Diagnostic Accuracy

Copyright © 2015 by the Institute for Quality and Efficiency in Healthcare (IQWiG).
Bookshelf ID: NBK385786


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (1.5M)

More in this collection

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...