Hydrometeorological forecast verification: the detail matters!

Contributed by Seonaid Anderson, CEH. 

Spending time and effort developing a novel hydrological ensemble prediction system, demonstrating its value, comparing with existing systems, justifying all that effort… there are many reasons why we verify our ensemble prediction systems. Before we dive into the number crunching, a few obvious questions need to be asked.

  • What period to use for evaluation? This is often as long as possible, while maintaining relevance to the current application and subject to the usual computing constraints.
  • What verification metrics should we use? There are so many now described in the literature! One could go for a selection that verify a range of forecast characteristics, the most cited, the older ones which are well accepted, or use as many as possible not to miss any (or risk upsetting a potential reviewer by missing out their favourite), or go for ones we are personally familiar with.
  • What details need to be considered for the system to be useful? Of course, there are many other details to be agreed to ensure that the verification will provide useful and meaningful results, appropriate to the questions being asked about the underlying system.

W:\hymod\Flood Forecasting Centre\G2G Ensemble Verification\Journal paper\HEPEX Blog\WordCloud.png

In our work on joint hydrometeorological ensemble verification – Towards operational joint river flow and precipitation ensemble verification: considerations and strategies given limited ensemble records – recently published in the Journal of Hydrology, we pause to think before diving into a full system verification, and before turning the handle on a vast number of calculations, plots and statistics.

With the aim of developing verification results which are useful and meaningful to our audience – here operational forecasters at the Flood Forecasting Centre (for England & Wales) and Scottish Flood Forecasting Service – we consider the detail of how one should verify the river flow forecasts and the associated input rainfall forecasts.

  • How can we focus on the flood-producing events of interest?
  • What about sample size and robustness of the statistics?
  • So what does this mean for today’s forecast?

Of course, it will require many years of research for these questions to be fully answered, if that is indeed ever possible! In the meantime, we can reflect on what has already been achieved by the hydrological and meteorological verification communities, and how this can be best applied in our operational context (and continue writing proposals to research the underlying unanswered questions).

Several key areas were identified in our study to design a verification framework useful to those “on the forecasting bench”. In particular we note the following.

We care about catchments

In the meteorological literature precipitation is verified at point locations, or using gridded (radar- or raingauge-derived) precipitation maps. This is not much use for hydrological forecasting. Even to evaluate a distributed hydrological model we need to understand the performance of precipitation in units of hydrological catchments and river basins. Our verification against flow observations are for catchments with river gauging stations at their outlet.

The selection of time periods is not obvious

Our area-wide hydrological model (the Grid-to-Grid Model, or G2G), configured on a 1 km grid across Britain, forecasts instantaneous river flow every 15 minutes with 15 minute rainfall accumulations used as input. However catchment response timescales are generally longer, varying from around 1 hour to several days. Additionally, forecasts are issued over a given forecast horizon: for example over the next day (what are the chances of this river flooding in the next 24 hours?) or next few days (is this river likely to flood over the weekend?). Considering these issues in detail, we decided to evaluate daily and hourly precipitation accumulations and river flows rising over a given river flow threshold in a 24-hour period.

The selection of appropriate thresholds opens an even larger can of worms!

For river flow we need to evaluate ensembles with regard to thresholds used operationally when issuing flood guidance: that is, using river flows equal to a particular return period (as a spatially consistent indicator of flood severity) at each catchment. For rainfall, the question is much more open. Only considering rainfall over a particular amount (say 10 mm in a given hour) would overly restrict the resulting sample size, and might mean flood-producing precipitation was missed from the analysis. Even low precipitation amounts could result in flooding if persistent and falling on already saturated soil. To consider all possible flood-producing precipitation, we adapt the spatial percentile approach used in the meteorological community to catchment-average precipitation values. Although not perfect (in many cases non-flood-producing precipitation is evaluated), at least the extreme precipitation values are evaluated when they occur!

To produce a meaningful hydrometeorological forecast verification, relevant to and useful for the target audience (whoever that may be), it is essential to give detailed thought to these kind of questions and considerations.



  1. This is a nice blog, Seonaid, and thanks for the contribution! Do you provide statistics of forecasting skill using discharge observations and/or pseudo-observations (model reality)? How do you communicate the forecasts in regions that the G2G model does not perform well?

    1. Thanks Ilias. In this work we calculate all forecast statistics using observations (of either river flow or precipitation), to assess how the ensembles predict the river flow and precipitation values that occur in reality (not model reality!). How do we communicate the forecasts in regions that the G2G model does not perform well? Good question. We are currently working on ways of presenting verification statistics in a meaningful and useful way, in this project, and others. It is of course important to be honest about the model performance, and to present the information at an appropriate level. The questions of time periods, spatial scales, and thresholds also come up again here – for all locations there is likely to be a time period and spatial scale below which a model is no longer accurate, and also a time period and spatial scale over which a model does provide useful information. A topic for further research and discussion!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.