HEPEX Science and Challenges: Verification of Ensemble Forecasts (4/4)

Contributed by James Brown and Julie Demargne

What are the challenges and research needs in ensemble forecast verification?

Scientific and operational verification challenges include:

  • to ensure that hindcasting and verification is an integral and routine component of hydrometeorological and hydrologic forecasting; while common in the atmospheric sciences, verification remains uncommon or cursory in hydrology;
  • to compare methods developed by the atmospheric and hydrologic modelling communities, including methods that link single-valued forecast verification and probabilistic forecast verification;
  • to consider jointly (but distinctly) the problem of improving the forecasting system, for which we need to evaluate the different sources of skill and uncertainty, and the problem of evaluating whether a forecasting system is useful, for which we need to know how a forecast is used to improve a decision-making process; this includes the selection of key verification metrics and summary scores that could effectively help forecasters and end user in their decision making, as well as techniques for verifying real-time forecasts (before the corresponding observation occurs);
  • to propose methods which are appropriate for multivariate forecasts (e.g., forecasts issued for more than one location and forecasts providing values for more than one time step) and methods to analyze forecast quality on multiple space and time scales;
  • to propose methods to characterize attributes of multivariate forecasts, such as timing error, peak error and shape error in hydrologic forecasts, and develop products for timing versus amplitude uncertainty information that are meaningful to forecasters and end users;
  • to define a optimal set of benchmarks to 1) demonstrate the value of a hydrologic ensemble forecasting system compared to an existing (deterministic or probabilistic) forecasting system, 2) assess whether a hydrologic ensemble forecasting system is useful for decision-making purposes, 3) and analyze the different sources of forecast uncertainty and their interactions to help improve a forecasting system;
  • to propose methods for verifying rare events and specifying sampling uncertainty of verifications scores;
  • to understand how to account for statistical dependencies in hydrometeorological and hydrologic variables and to design verification measures that are sensitive to the correct representation of statistical dependencies in multivariate forecasts;
  • to propose methods which take into account observational error (both measurement and representativeness errors).

Any other?

This post is a contribution to the new HEPEX Science and Implementation Plan.

See also in “HEPEX Science and Challenges: Verification of Ensemble Forecasts”:


  1. On the question of “any others”… I’m not sure if it’s a science challenge as much as it is a practical (operational?) one, but forecast archives as just as dirty as observational databases. There are spikes and typos in the archiving process… but there are also real model malfunctions that may result in strange products actually being issued to the public. These aren’t terribly common (once every 5 years?) but they do happen.

    The question then is what to do with these? Should you try and figure out the –intent– of the forecaster to censor or replace strange forecasts? Or should you take the perspective of the user– and are we talking about a user who plugs the forecasts directly into his own decision making model? Or would the user apply a “whiff test” and know to be sceptical?

    Addressing these problems requires investigation and often the metadata is insufficient to figure out why a strange forecast happened (e.g. did the model get 65536 mm of rainfall?).

    In summary, what QA/QC should be done on forecast archives and what, if anything, should be done with poor quality (not necessarily poor skill) forecasts?

  2. It is an interesting point: like observations, forecast archives also have errors! This also makes me think of the verification of “subjective probability” forecasts. We usually publish on quantitative verification of model outputs, but how can we verify forecasts (model outputs) that were modified by forecasters? Should we use the same procedures and statistical scores? How such verification results differ from the verification of raw model outputs? In meteorology, Murphy and Daan indicated in a paper published in 1984 (see below) that there was evidence that feedback given to forecasters on the quality of their forecast “(…) were instrumental in enabling the forecasters to formulate more reliable and skilful forecasts (…)”. It would be interesting to know more about similar studies in hydrologic forecasting.

    * Murphy A. H., Daan H. (1984) Impacts of feedback and experience on the quality of subjective probability forecast: Comparison of results from the first and second years of the Zierikzee Experiment. Monthly Weather Review, 112: 413-42.

    1. This is starting to get back to part 1 of this 4 part series about how there’s forecast verification and there’s forecast services evaluation… You make the point that humans can make the verifications better (and verification can make the humans better).

      But there’s also the subtler point of how you evaluate the value of the forecaster in the services side. There’s all the intangibles of customer relations, etc. but I’m thinking more of irregular products. Flood warnings (like tornado warnings, I suppose) are issued as the opportunity comes, not on a fixed schedule (making it hard to write verification code arrays and loops for i = 1 to issue date, for j = 1 to leadtime). Furthermore, flood warnings are narratives rather than strictly numerical. How do you archive warnings in a way that facilitates evaluation (e.g. with tagging/coding)? They joke that you should “forecast the location, time and intensity, but never all three at once”- surely you need all three for verification, but if you don’t have all three in a human generated text product, is there still value and how can you measure it? In Australia, the narrative text product is all users ever see, so to them verification of the numerical model output is at least one step away from what they experience.

      1. I am not sure if the evaluation of model output forecasts that were modified by forecasters is an issue of the “forecast service evaluation” aspect (at least not as I understood from the first post by Julie and James). I was wondering if it would not rather be like evaluating “raw” (without statistical post-processing) against post-processed forecasts. If we imagine the forecaster as a “human post-processor” that would probably fit like this. Having said that, the issue raised about how to evaluate services (and show users that these are also important aspects of the quality of a forecasting system) remains an interesting challenge.

        1. Yes, I’d tend to think of these modifications as manual adjustments that are attempting to reduce biases or improve skill by accounting for something that is either missing or poorly represented in the forecasts. These mods are extremely diverse in nature (if Andy’s reading, he may care to list some) and could involve the model inputs or states, among other things. For those modifications concerned with states, you could think of them as “manual DA” – one reason for the push to get automated DA implemented operationally (it makes the job of reproducing operational practice via hindcasting much easier, in theory).

          Evaluating services/delivery is an interesting topic though – perhaps more locally variable and difficult to generalize, although there is some good practice emerging w/r to presenting info. in ensemble forecasts etc.

          1. I like this: “– one reason for the push to get automated DA implemented operationally (it makes the job of reproducing operational practice via hindcasting much easier, in theory)”, and I would say that it is probably the same for automated post-processing. Interesting point!

  3. Absolutely, data QC is an issue. In general, I think it makes sense to correct any major blunders upfront, because these can have a disproportionate impact on forecast quality, depending on sample size and choice of metric, and they shouldn’t really be “averaged” in a quality metric. Verification metrics provide measures of average performance over a sample, yet blunders are often quite unique.

    Also, in my experience, lack of access to long and consistent operational archives (or archived real-time modifications) is the first barrier, before issues of data quality. For example, in regulated rivers, it’s often very difficult to reconstruct operational practice through hindcasting, because diversions or releases etc. are frequently applied in real-time, but are not archived. So, to summarize my immediate thoughts:

    1. Basic QC should be a prerequisite to verification
    2. Most forecast quality metrics (averages) only make sense when the forecast quality is reasonably stable over the sample, but certainly free from any huge QC issues. Of course, there are many other sources of variability in practice, ranging from operational modifications on short time-scales all the way to climate change and floodplain development. In an ideal world, we’d be able to control for the different sources of (poor) forecast quality through appropriate conditional verification.
    3. Archives of operational forecasts and any real-time modifications (such as diversions) are very important, in order to reproduce operational practice and ensure hindcasting/verification can provide meaningful guidance.



  4. I agree that the definition of optimal set of benchmarks is an important issue which is discussed a lot throughout the hydrological literature (Nash-Sutcliffe is benchmark after all). I am concentrating on the ‘numerical’ part of benchmarking in what follows. I am not so sure it is possible to define the ‘right’ benchmark , simply because systems and research papers have very different aims. I have analysed 120 papers on the topic of HEPEX forecasting and the following benchmarks have been used:
    Climatology: 29%
    Persistence: 8%
    Visual: 28%
    Existing Operational System: 6%
    Simplified Model: 8%
    Random Guess: 2%
    None: <1% (but greater than 0)
    Unclear: <1 % (but greater than 0)
    Compare different methods in paper: 19%
    (multiple benchmarks where accounted multiple times)

    Simplified models are defined as all methods which are more complex than climatology and persistence (could be 0 rain forcing or Ensemble streamflow prediction type approaches). Random Guess are for example ROC area type approaches in which the forecast is compared to the "flip of a coin" (which is interesting as many forecast events do not have a 50% chance).

    Catchment size or time step or hydro-climatology does not guide the choice of benchmarks. However, there is a relationship to lead time - all seasonal forecast systems use climatology. There is also a relationship to research clusters, meaning that certain research groups prefer certain benchmarks (which is natural). On the down side: in many publications the way climatology is computed is not properly defined (how is it really derived?).

    In principle there is nothing wrong with using any of the benchmarks above - visual comparison maybe fully appropriate for case studies and climatology is appropriate in seasonal forecasts. However, I think it would be nice if we can come up with at least some guidance on the topic. I believe that the use of 'none', 'unclear' and 'random guess' should be discouraged. If you do compare against climatology then you can also easily compare against persistence and hence there should be at least some small sentence justifying the choice. Although, it might be the pinnacle to beat an existing system, one should still have a look at traditional benchmarks such as persistence and climatology (what if your existing system is not very good?). Comparing different methods/approaches in your paper does not mean you should not also compare against an other benchmark (what if all your methods are no good?). I do like simplified "modelling approaches", but they are more complex to set-up and maybe a step too far for many studies.

    1. It is interesting to discuss benchmarks. You say “I am not so sure it is possible to define the ‘right’ benchmark, simply because systems and research papers have very different aims”. Also, the blog post indicates that one of the challenges in verification is “to define an optimal set of benchmarks”. At the end, it may be possible that there is no “one overall benchmark” (and this may also be true for hydrologic simulations that focus on getting the best Nash-Sutcliffe: “why is the average flow the right benchmark to test my model against it?”). Maybe there is a range of possible benchmarks (which might be open to new proposals regarding the currently most used ones). And users’ purposes may have a word here too.
      Also, I find interesting that “visual benchmark” has the second place in the list you mentioned (nice compilation, by the way) because they are basically “impossible” in probabilistic forecasting. How do you visually verify attributes like “reliability”? At the same time, “visual inspection” is something that assures most of us of being in the right track, and that helps forecasters to be confident about the products they issued the day after a forecast . Also, this may surely count on “adjusting” subjective probability forecasts (as discussed above) and, consequently, improve system’s performance.
      Climatology is certainly defined (or understood) differently by different people. I agree that it is important to explicitly indicate how one defines it. Persistence is probably easier to define (at least if you consider “the last observed discharge or water level”), but then it can be something more or less easy or difficult to over-perform according to catchment size (i.e., characteristic time to respond to rain events) or very dependent on the data assimilation procedure (if any) used. It may be easy to fix if you have one or similar-sized catchments to evaluate, but what about if you’re handling a set of catchments of different sizes/response times? Maybe fixing a way to calculate benchmarks (let’s say “guidelining it”) may be counter-productive. Forecasters may ask themselves: “Should my system be better than what I want (need) it to be or should it be better than that guideline fixed by the community”?

      1. The reason why there is no universal benchmark is because it is development specific (have a look at Nathalie Voison’s Hepex webex talk http://www.youtube.com/watch?v=_FGAy0TsMko (at 27Min, 40 seconds) where she highligths user specific benchmarks very neatly! However, I believe there is at least a case for defining best practise – which I think should be done as part of this SIP topic (feeding into your last point).

        I am not so surprised by the amount of visual evaluation. It is not uncommon to publish case studies given that there are often very few extreme cases available (low frequency of floods and droughts). Of course one should look at false alarm rates etc., but I share the excitement of checking whether my/a system worked after a major event. I think providing a visually based narrative rather than calculating meaningless statistics is something to be encouraged..

        1. No doubts that visual inspection should be encouraged, I think, at least for better understanding of how the system behaves in severe high or low flow situations, for instance. However, if you have an observed hydrograph and all the ensemble members (spaghetti plot), or a confidence interval, or a box plot, no matter what visualisation you use, in front of you, how can you say that your system is of good quality? Should there be any guidance for it? I think the “visually based narrative” after a major event as you mention is a good idea. Maybe this is a topic closely related to what the first post on the verification SIP topic (1/4) calls “real-time verification”. I think that these are issues that could be explored further in HEPEX.

  5. Interesting statistics on the baselines. Yes, while unconditional climatology is well-defined, there are many types of conditional climatology that could be chosen, such as a seasonal climatology or a climatology from the same day of the year across all historical years on record. Naturally, a conditional climatology provides a stronger baseline, but an unconditional climatology can be useful too.

    It can be helpful to evaluate the skill of two separate systems relative to some form of climatology, rather than each other, as that provides some “absolute” context for both systems (e.g. when comparing two relatively unskilful systems, the results can be very noisy and rather meaningless). I see persistence is mentioned too. This has a clear definition in single-valued forecasting, but not in ensemble/probability forecasting. What does probabilistic persistence look like?

    Naturally, there are no universal baselines as forecasting is problem-specific. For example, an old version of a HEPS could provide a useful baseline, highlighting the value added by improvements. Other useful baselines can be very narrowly defined to a particular problem. For example, in the context of post-processing, I find it useful to separate the skill gleaned from improvements in the unconditional versus conditional quantiles/moments of the probability forecasts; a climatological correction is very simple, and if a sophisticated technique is mainly contributing an improved climatology, it isn’t contributing much! For that, one would derive a baseline by applying a climatological correction to the raw forecasts. While it’s difficult to generalize, I think it should be possible to define useful or optimal benchmarks in a problem-focused way. Also, as Florian alludes to, it’s critically important that any benchmarks are very clearly defined to avoid ambiguity.

    On the question of visual inspection, I suppose that’s a rather general and ambiguous category (is this really a benchmark?), but I do find visual inspection incredibly useful. Not so much in terms of inspecting individual ensemble forecasts (afterall, an observation could fall anywhere within the ensemble spread or even outside and that wouldn’t imply that the forecasting system or even that specific forecast is poor), but it is very useful to explore raw data. For example, it’s very useful to explore scatter plots in single-valued forecasting. There are direct analogies in ensemble forecasting. These provide very valuable info. for properly interpreting statistical measures. These “data plots” could be conditional in nature. For example, one could search an archive for certain verification pairs whose forecasts share properties with a live operational forecast. That’s what is meant by “real time verification.” This sort of conditional data plotting may be useful in an operational context, alongside statistical measures.

    One final point. On the question of ROC and “coin flips.” The diagonal in a ROC plot is essentially a climatological or “unskilful” forecast. One could issue a forecast of a discrete event (e.g. flooding) with any given probability, but if the forecasting system is unskilful, the POD and PoFD will be approximately equal (or the PoFD even higher, which is analogous to a negative correlation!). The POD must be larger than the PoFD to imply skill. For example one could always forecast flooding with probability 1.0. In that case, the POD and PoFD would both be 1 and the system would lack skill w/r to flooding.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.