HEPEX Science and Challenges: Verification of Ensemble Forecasts (2/4)

Contributed by James Brown and Julie Demargne

What are the different attributes of forecast quality?

There are many aspects or “attributes” of forecast quality which, although measured differently by particular verification metrics, share common origins in the joint probability distribution of the forecasts and observations (see below). Attributes of forecast quality include (Wilks, 2006, Jolliffe and Stephenson, 2003):

Bias of the single-valued estimate from the forecast (or first order bias, overall bias, unconditional bias), to measure its agreement with the observed outcome on average.

Accuracy of the forecast by measuring the difference between the forecast and the observation, which is the forecast error.

Correlation of the single-valued estimate from the forecast, to describe its linear relationship with observations.

Skill to estimate if the verified forecast is of a higher or lower quality than a given reference forecast. Skill requires the selection of one verification metric and one reference forecast, which could be climatology, persistence or a baseline forecast. This is particularly important when comparing forecast systems across different hydroclimatic regimes or to establish improvements in forecast systems.

Reliability or Type-I conditional bias to describe the agreement between, for one or more subsamples of the verification data, the observations for the subsamples and the respective forecasts. It is relative to the conditional distribution of the observations given the forecasts. For example, a flood ensemble forecast system is reliable, or conditionally unbiased in its forecast probabilities, if flooding is observed 20% of the time when it is forecast with probability 0.2 (the evaluation being repeated for all forecast probabilities). Note that, when conditioning on the observed variable, the resulting bias is known as “Type II conditional bias.”

Resolution to describe the ability of the forecast to sort a set of observed events into subsets with different frequency distributions. It is also relative to the conditional distribution of the observations given the forecasts. A flood ensemble forecast system has resolution if small changes in the forecast probabilities are associated with different observed outcomes, whether or not the forecast probabilities are reliable.

Discrimination to describe whether the forecast system can discriminate between events and non-events. It is relative to the conditional distribution of the forecasts given the observations. It helps answer questions like: if the observations are in the flood level category, what did the forecasts predict? An ensemble forecast system is discriminatory with respect to a given flood threshold if it consistently forecasts the (observed) flood occurrence with a probability higher than chance (i.e., climatology) and consistently forecasts the (observed) non-flood event with a probability lower than chance.

Sharpness is an attribute of the forecasts alone and measures the tendency to predict with extreme probabilities (0 or 1). A high degree of sharpness is only desirable in the context of other measures improving upon climatology (i.e., an unsharp forecast). For example, without reliability, sharp forecasts are misleading.

Mathematically (see e.g., Bradley et al. 2003 and 2004), forecast quality consists in examining the joint probability distribution function (pdf) of the forecasts, Y, and observations, X, f _XY( x, y ).

prob-distributions-verification — Copyright © 1998-2010 Charles Annis, P.E.

The joint distribution can be factored into (Murphy and Winkler, 1987):

f_X|Y( x | y ) ∙ f_Y ( y ), known as the “calibration-refinement” factorization,
f_Y|X ( y | x ) ∙ f_X ( x ), known as the “likelihood-base rate” factorization.

Differences between the marginal distributions f_X( x ) and f_Y( y ) describe the unconditional biases in the forecast probabilities. The comparison of the conditional pdf f_X|Y( x | y ) with f_Y( y ) describes the (conditional) reliability of the forecast probabilities.

The forecast resolution concerns only the sensitivity of the conditional pdf f_X|Y( x | y ) to f_Y( y ), without being affected by the consistency between pdf f_X|Y( x | y ) and f_Y( y ).

Note that, for a given level of reliability, forecasts that contain less uncertainty, i.e., “sharp forecasts”, may be preferred over “unsharp” ones, since they contribute less uncertainty to decision making (Gneiting et al., 2007).

The comparison of the conditional pdf f_Y|X( y | x ) with f_X( x ) describes the ability of the forecasts to discriminate between different observed outcomes.

In general, several forecast attributes are important for a forecasting system to be useful for end users. However, for a particular application of the forecasts, some attributes of forecast quality may be more important than others. Data visualization of forecast and observed quantities will also help identify possible weaknesses in the forecast (as well as data issues).

What are the most commonly used verification metrics?

A wide range of verification metrics has emerged in the atmospheric sciences and, more recently, in other areas, such as hydrology (e.g., Brown et al. 2010, Liu et al. 2011, Zappa et al. 2012). The Joint Working Group on Forecast Verification Research from the World Weather Research Programme (WWRP) and the Working Group on Numerical Experimentation maintains a reference website describing standard and newly-developed verification metrics, as well as freely available verification tools and packages.

The following table includes standard verification metrics used in operational hydrometeorological forecasting. These metrics can be applied to:

single-valued forecasts, which could be either deterministic forecasts or best single-valued estimates from ensemble forecasts,

probabilistic forecasts, probabilities being derived from ensembles or from probability distribution functions.

Some of the metrics refer to forecasts of discrete events (e.g., defined as the variable exceeding a threshold). Both single-valued and probabilistic forecasts can define the exceedance of one or multiple discrete thresholds. For dichotomous forecasts that concern only one discrete event (e.g., the occurrence of a flood), one can define the contingency table, which lists the numbers of hits, misses, false alarms, and correct negatives (for a given probability level in the case of probabilistic forecasts). A number of metrics (e.g., Critical Success Index, Probability Of Detection) can be derived from the contingency table. For verifying probabilistic forecasts for one discrete event for all probabilities, the Brier Score measures the mean square error of the forecast probabilities where the observations are either 0 (no occurrence) or 1 (occurrence).

Table of standard verification metrics commonly used in operational hydrometeorological forecasting (see here for further details).

*Quality attribute*	*Metric name*	*Type of forecast*	*Discrete events?*
Error	Mean Absolute Error	Single-valued	No
	Mean Square Error	Single-valued	No
	Root Mean Square Error	Single-valued	No
	Mean Continuous Rank Probability Score (CRPS)	Probabilistic	No
	Brier Score	Probabilistic	Yes
	Critical Success Index (or Threat Score)	Both	Yes
Bias	Relative Mean Error (or Relative Bias)	Single-valued	No
Bias	Frequency Bias	Both	Yes
Correlation	Pearson Correlation Coefficient	Single-valued	No
Correlation	Spearman Rank Correlation	Single-valued	No
Skill	Mean Absolute Error Skill Score	Single-valued	No
	Mean Square Error Skill Score	Single-valued	No
	Mean Continuous Rank Probability Skill Score	Probabilistic	No
	Brier Skill Score	Probabilistic	Yes
	Equitable Threat Score (or Gilbert Skill Score)	Both	Yes
Reliability (conditioned on forecast)	Mean CRPS Reliability	Probabilistic	No
	Brier Score Reliability	Probabilistic	Yes
	Reliability Diagram	Probabilistic	Yes
	Rank Histogram	Probabilistic	Yes
	Success Ratio	Both	Yes
Resolution (conditioned on forecast)	Mean CRPS Resolution	Probabilistic	No
Resolution (conditioned on forecast)	Brier Score Resolution	Probabilistic	Yes
Discrimination (conditioned on observation)	Relative Operating Characteristic Score	Both	Yes
	Relative Operating Characteristic Diagram	Both	Yes
	Probability Of Detection (or Hit Rate)	Both	Yes
	Probability Of False Detection (or False Alarm Rate)	Both	Yes
Sharpness	Forecast Frequency Histogram	Probabilistic	Yes

The complete list of references can be found here.

This post is a contribution to the new HEPEX Science and Implementation Plan.

See also in “HEPEX Science and Challenges: Verification of Ensemble Forecasts”:

HEPEX Science and Challenges: Verification of Ensemble Forecasts (1/4)

Coming up next:

HEPEX Science and Challenges: Verification of Ensemble Forecasts (3/4)

HEPEX Science and Challenges: Verification of Ensemble Forecasts (2/4)

Like this:

0 comments

Leave a ReplyCancel reply

HEPEX Science and Challenges: Verification of Ensemble Forecasts (2/4)

Share this:

Like this:

0 comments

Leave a ReplyCancel reply