How can the Brier score know my inner thoughts???

Contributed by Anders Persson, a HEPEX guest columnist for 2014

Inspired by Tom Pagano’s interview with Beth Ebert on verification 22 August, I would support Beth’s enthusiasm by pointing out some more exciting features in the verification statistics:

1. The problem to interpret the statistics: what “looks good” might be bad, what “looks bad” might be good”. One morning in 1995, the ECMWF Director pressed the alarm bell. The most recent verified five-day forecast scored minus 20% in the anomaly correlation against normally around plus 70%. The other models were worse than normal, but not below 30%. But there was no major problem, just a matter of “double penalty”. The ECMWF forecast was penalised for having a storm, east of Scotland, where there was none and then penalised for not having one, west of Scotland, where there was one. The other models were punished only once – for not having the storm at all.

2. The problem with the verifying observations – it is not always obvious what is the “truth”. At SMHI, verification of the gale warnings started in the 1920’s. They showed a steady improvement until late 1950’s, when the scores dropped. It was because off-shore lighthouses along some coastal stretches had been closed. The remaining in-shore ones were slow to respond to off-shore gale force winds. We, forecasters, were thoroughly educated in semantics, how to formulate the shipping forecast, so, on one hand, the sailors were warned, and, on the other hand, we didn’t have to issue a formal warning in cases of off-shore gales – and thus worsen the statistics.

3. The challenge with probability forecasting. The non-intuitive, and therefore fascinating, nature of probabilities, also applies to their verification. As an example, let me introduce perhaps the most common probability verification score, the Brier score (BS), after its inventor Glenn W. Brier (1913-1998).

Mathematically, the BS is quite simple. It is the average of the sum of the squared difference between probability and event counted as 1 or 0, if it occurred or not. The lower the BS, the better.

If the forecast probability is P and the verifying observation is O, then, for a single forecast, the contribution to the Brier Score verification is BS1 = (P-O)2. If the event occurs O=1, if the event does not O=0, (figure 1).

The total BS is an integrated or averaged sum of all BS1 contributions over a longer period. Its value depends on the climate of the region, i.e., on how often the forecasters have reasons to issue different probabilities.

However, the Brier Score is a so called “proper score”, which means that it will punish you if you make a forecast that you do not quite believe in. Strange, how can the lifeless simple mathematics know about my inner thoughts? Is it some “Big Brother” I have not been aware of?

Hedging the bets

Assume you are issuing a probability forecast for storms in the mountains, flooding on the plains or cold outbreaks. You are an experienced forecaster with reliable probability forecasts, i.e., when you, on 10 occasions, have said “the probability is 20%” the event will, on average, occur twice; when you have said 80%, it will on average occur eight times.

One day you estimate the probability to 50%. With a 50% forecast (P=0.5) the BS1 will score 0.25 irrespective if the event occurs or not since BS1=(1-0.5)2 = (0-0.5)2=0.25.

But for different reasons, you are not happy to issue that 50% value. You might for example suspect that 50%, or 50-50, might be interpreted as “we do not know”. So you issue a 40% probability forecast instead.

BS1 will take two different values depending on if it rains or not. The most likely value is a weighting together of the two. Here comes the crucial point: If you use the weights 0.4 and 0.6, which would be logical since you issued a 40% forecast, the likely BS1-value will be slightly worse than if you use the weights 0.5 and 0.5 on the BS1-values of true 50% opinion (figure 2).

You can apply the same reasoning to a case when you would like to “sex up” your 50% forecast to a 70% probability (figure 3).

Again, depending on if it will rain or not you will get two BS1-values. If you weight them together with 0.3 and 0.7, which would be logical from your 70% forecast, you will do worse on average than weighting the BS1-values of the true 50% opinion, with 0.5 and 0.5, the weights you have reasons to trust more.

Do I dare to change an automatic probability forecast?

Being an experienced forecaster with a record of reliable probability forecasts, there is nothing to fear. You should not hesitate to modify a probability forecast provided by a reliable probabilistic system, such as the ECMWF ensemble system if you have good reasons.

If the ensemble suggests a probability of 30% you can very well increase the probability to 50% or even 70% (or decrease to 30% or 10%) if you have additional information (for example, from newly arrived observations and/or deterministic forecasts).

It is only when you, for political or psychological reasons, want to change the forecast against your “better judgement” that the “proper” Brier Score will punish you. More mathematical arguments for this “moral” attitude are found in figure 4.

The peril of relying on deterministic numerical output.

The final figure 5 illustrates a location with 50% climatological probability of rain (182-183 days/year). The well tuned state-of-the art deterministic NWP model predicts rain and no rain equally often, but, of course, not perfectly. If the forecasters blindly follow this model and issue categorical “rain” or “no rain” forecasts, they will score badly. The same will happen if they make categorical interpretations of probability information.