HEPEX Science and Challenges: Verification of Ensemble Forecasts (2/4)

Contributed by James Brown and Julie Demargne

What are the different attributes of forecast quality?

There are many aspects or “attributes” of forecast quality which, although measured differently by particular verification metrics, share common origins in the joint probability distribution of the forecasts and observations (see below). Attributes of forecast quality include (Wilks, 2006, Jolliffe and Stephenson, 2003):

  • Bias of the single-valued estimate from the forecast (or first order bias, overall bias, unconditional bias), to measure its agreement with the observed outcome on average.
  • Accuracy of the forecast by measuring the difference between the forecast and the observation, which is the forecast error.
  • Correlation of the single-valued estimate from the forecast, to describe its linear relationship with observations.
  • Skill to estimate if the verified forecast is of a higher or lower quality than a given reference forecast. Skill requires the selection of one verification metric and one reference forecast, which could be climatology, persistence or a baseline forecast. This is particularly important when comparing forecast systems across different hydroclimatic regimes or to establish improvements in forecast systems.
  • Reliability or Type-I conditional bias to describe the agreement between, for one or more subsamples of the verification data, the observations for the subsamples and the respective forecasts. It is relative to the conditional distribution of the observations given the forecasts. For example, a flood ensemble forecast system is reliable, or conditionally unbiased in its forecast probabilities, if flooding is observed 20% of the time when it is forecast with probability 0.2 (the evaluation being repeated for all forecast probabilities). Note that, when conditioning on the observed variable, the resulting bias is known as “Type II conditional bias.”
  • Resolution to describe the ability of the forecast to sort a set of observed events into subsets with different frequency distributions. It is also relative to the conditional distribution of the observations given the forecasts. A flood ensemble forecast system has resolution if small changes in the forecast probabilities are associated with different observed outcomes, whether or not the forecast probabilities are reliable.
  • Discrimination to describe whether the forecast system can discriminate between events and non-events. It is relative to the conditional distribution of the forecasts given the observations. It helps answer questions like: if the observations are in the flood level category, what did the forecasts predict? An ensemble forecast system is discriminatory with respect to a given flood threshold if it consistently forecasts the (observed) flood occurrence with a probability higher than chance (i.e., climatology) and consistently forecasts the (observed) non-flood event with a probability lower than chance.
  • Sharpness is an attribute of the forecasts alone and measures the tendency to predict with extreme probabilities (0 or 1). A high degree of sharpness is only desirable in the context of other measures improving upon climatology (i.e., an unsharp forecast). For example, without reliability, sharp forecasts are misleading.

Mathematically (see e.g., Bradley et al. 2003 and 2004), forecast quality consists in examining the joint probability distribution function (pdf) of the forecasts, Y, and observations, X, f XY ( x, y ).


The joint distribution can be factored into (Murphy and Winkler, 1987):

  • fX|Y ( x | y ) ∙ fY ( y ), known as the “calibration-refinement” factorization,
  • fY|X ( y | x ) ∙ fX ( x ), known as the “likelihood-base rate” factorization.

Differences between the marginal distributions fX ( x ) and fY ( y ) describe the unconditional biases in the forecast probabilities. The comparison of the conditional pdf fX|Y ( x | y ) with fY ( y ) describes the (conditional) reliability of the forecast probabilities.

The forecast resolution concerns only the sensitivity of the conditional pdf fX|Y ( x | y ) to fY ( y ), without being affected by the consistency between pdf fX|Y ( x | y ) and fY ( y ).

Note that, for a given level of reliability, forecasts that contain less uncertainty, i.e., “sharp forecasts”, may be preferred over “unsharp” ones, since they contribute less uncertainty to decision making (Gneiting et al., 2007).

The comparison of the conditional pdf fY|X ( y | x ) with fX ( x ) describes the ability of the forecasts to discriminate between different observed outcomes.

In general, several forecast attributes are important for a forecasting system to be useful for end users. However, for a particular application of the forecasts, some attributes of forecast quality may be more important than others. Data visualization of forecast and observed quantities will also help identify possible weaknesses in the forecast (as well as data issues).

What are the most commonly used verification metrics?

A wide range of verification metrics has emerged in the atmospheric sciences and, more recently, in other areas, such as hydrology (e.g., Brown et al. 2010, Liu et al. 2011, Zappa et al. 2012). The Joint Working Group on Forecast Verification Research from the World Weather Research Programme (WWRP) and the Working Group on Numerical Experimentation maintains a reference website describing standard and newly-developed verification metrics, as well as freely available verification tools and packages.

The following table includes standard verification metrics used in operational hydrometeorological forecasting. These metrics can be applied to:

  • single-valued forecasts, which could be either deterministic forecasts or best single-valued estimates from ensemble forecasts,
  • probabilistic forecasts, probabilities being derived from ensembles or from probability distribution functions.

Some of the metrics refer to forecasts of discrete events (e.g., defined as the variable exceeding a threshold). Both single-valued and probabilistic forecasts can define the exceedance of one or multiple discrete thresholds. For dichotomous forecasts that concern only one discrete event (e.g., the occurrence of a flood), one can define the contingency table, which lists the numbers of hits, misses, false alarms, and correct negatives (for a given probability level in the case of probabilistic forecasts). A number of metrics (e.g., Critical Success Index, Probability Of Detection) can be derived from the contingency table. For verifying probabilistic forecasts for one discrete event for all probabilities, the Brier Score measures the mean square error of the forecast probabilities where the observations are either 0 (no occurrence) or 1 (occurrence).

Table of standard verification metrics commonly used in operational hydrometeorological forecasting (see here for further details).

Quality attribute Metric name Type of forecast Discrete events?
Error Mean Absolute Error Single-valued No
Mean Square Error Single-valued No
Root Mean Square Error Single-valued No
Mean Continuous Rank Probability Score (CRPS) Probabilistic No
Brier Score Probabilistic Yes
Critical Success Index (or Threat Score) Both Yes
Bias Relative Mean Error (or Relative Bias) Single-valued No
Frequency Bias Both Yes
Correlation Pearson Correlation Coefficient Single-valued No
Spearman Rank Correlation Single-valued No
Skill Mean Absolute Error Skill Score Single-valued No
Mean Square Error Skill Score Single-valued No
Mean Continuous Rank Probability Skill Score Probabilistic No
Brier Skill Score Probabilistic Yes
Equitable Threat Score (or Gilbert Skill Score) Both Yes
Reliability (conditioned on forecast) Mean CRPS Reliability Probabilistic No
Brier Score Reliability Probabilistic Yes
Reliability Diagram Probabilistic Yes
Rank Histogram Probabilistic Yes
Success Ratio Both Yes
Resolution (conditioned on forecast) Mean CRPS Resolution Probabilistic No
Brier Score Resolution Probabilistic Yes
Discrimination (conditioned on observation) Relative Operating Characteristic Score Both Yes
Relative Operating Characteristic Diagram Both Yes
Probability Of Detection (or Hit Rate) Both Yes
Probability Of False Detection (or False Alarm Rate) Both Yes
Sharpness Forecast Frequency Histogram Probabilistic Yes

The complete list of references can be found here.

This post is a contribution to the new HEPEX Science and Implementation Plan.

See also in “HEPEX Science and Challenges: Verification of Ensemble Forecasts”:



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.