The Current Status of Seasonal and Inter Annual Climate Forecasts

When confronted with any form of seasonal prediction the first question usually asked by potential users is "How good are your forecasts?". Often this is followed by the deterministically-based query "How often is your forecast correct?". These are two questions the climatological community has difficulty in responding to in a manner that satisfies most users.

Amongst the numerous issues confronting climate scientists when attempting to validate their models and verify their rainfall forecasts is the problem of data homogeneity when the model is providing spot information at scales of a few hundred kilometres. These spot values are typically averaged to produce a single figure for each model cell, averages which then need to be compared with observations from a set of points somewhat randomly scattered around the cell, data records with inevitable missed observations, and reflecting topography not resolved by the model. Undoubtedly the noise introduced by this process impacts on validation and verification results. The difficulties are compounded when there is a need to compare not a series of single deterministic predictions but a sequence of probabilistic ones. A further important consideration is that there are relatively few realisations of seasonal predictions with which to work. As far as is known to the author no empirical technique employs a data series that exceeds 100 years in length, and most are based on much shorter series, some perhaps no more than 30 years in length. Often with only one prediction per year in the series, these are statistically seriously limited data sets for assessing deterministic systems; for probabilistic systems they are undoubtedly inadequate. Numerical modelling studies, because of data problems and of computing costs, have fewer realisations still. One of the most extensive currently-available series is that from the PROVOST experiment, that reached across 15 years.8 The succeeding DEMETER project is planned to cover 40 years,9 bountiful by any current measure of numerical seasonal prediction experimentation, inadequate by any statistical measure.

Even accepting the caveats above, it still remains difficult to provide specific information on the quality of forecasts. Probably the major difficulty is that the diagnostics used are often based on those developed for short-range predictions and for informing modellers. Diagnostics that specifically inform users are thin on the ground, although some exist. Probably the most common diagnostic used in seasonal prediction is the correlation between predicted and observed, an approach particularly used for sea surface temperatures, including those measuring ENSO, and for predictions from empirical models. There is an argument, often discussed by meteorologists, that the minimum correlation required for 'skill' is (±)0.6, but many seasonal forecast systems fail to reach this level, although in some favoured areas far higher values have been obtained. Correlation, however, has many drawbacks as a diagnostic, and, while useful as a first indicator, it is not one that supplies value to the user question "What decisions can I base on these forecasts"?

Numerous other diagnostics are available, but most forecast centres tend to have in-house preferences that often differ from those of other centres. Thus it is difficult to cross-compare quality of forecasts between centres. Several standardisation activities are now underway in various areas of WMO, with one in particular being designed to assist intercomparison of operational predictions.10 Three diagnostics are being used, the first of which, Mean Square Skill Score (MSSS), is an extension to the Mean Square Error (MSE) commonly used by meteorologists. MSE, the averaged sum of the squares of differences between forecasts and observations, provides an indication of improvements in forecasts as its value hopefully reduces in time, the ideal being zero. Beyond that its interpretation is difficult, particularly from the perspective of applications. In MSSS MSE values from the forecasts are compared with MSE values of a standard simple forecast in such a way that a zero value means both are identical by this measure. A positive MSSS value indicates that the forecast is an improvement compared to the standard; this improvement is measured in a linear sense so that perfect forecasts (those with an MSE of zero) give an MSSS of 100%. The standard forecast normally used is climatology or persistence, and MSSS at least gives the forecast user an indication of the extent to which the forecast improves on the chosen standard. It applies only to deterministic predictions.

The second diagnostic, Relative Operating Characteristics (ROC) (Mason and Graham, 2002), is of more use to applications, and has the advantage of being applicable to both deterministic and probabilistic predictions. ROC works through events that can be defined in any way consistent with the capabilities of the prediction system and preferably of interest to applications. For example, a specific event might be 'above-average rainfall', or 'temperatures in the lowest 20% of historic events', or any other definable binary value. Then for a deterministic prediction the hit rate (HR) is the proportion of observed events that were correctly predicted. Similarly the false alarm rate (FAR) is the proportion of occasions that the event did not happened but on which the event was predicted to occur. The extension to probability forecasts is achieved by calculating HR and FAR for a sequence of cases in which the event is taken to be predicted when the forecast probability equals or exceeds a specific value. Often this is done in 10% steps and when values of HR and FAR are then plotted against each other the ROC curve emerges. Provided the event is chosen to reflect the interest of applications, then the combination of HR and FAR is readily interpreted in terms of user actions, although it is immediately clear from the curve that attempts to maximise HR will also lead to increases in FAR.

The third diagnostic, reliability, is inherent in ROC and is applicable to probability forecasts only. Stated directly, a forecast is said to reliable if across all occasions on which an event is predicted to occur with, say, 40% probability then it does indeed occur on 40% of occasions. The reliability curve is normally plotted with pairs of values at 10% probability intervals in order to give an overall view of reliability.

Given the range of diagnostics used, and the inadequate time to achieve results as yet from the standardisation activities, it is difficult to create an overarching statement on the quality of the predictions. Nevertheless a number of generalisations may be made. Probably the first statement to be made from the meteorological perspective (the applications perspective is treated below) is that, whilst variable both spatially and between seasons, skill over and above climatology does exist across much of the globe. For example, early results indicating that skill over Europe was marginal and amongst the lowest on the planet have been overturned in the PROVOST project, in which some remarkably skilful predictions were achieved (Graham et al., 2000). Nevertheless high levels of skill are not available over Europe in all years. Projects such as PROVOST have permitted consistent assessment of skill on a global scale.

The variety of models targeted on predicting ENSO-related tropical sea surface temperature variations in the Pacific Ocean all have measurable skill out to perhaps nine months to a year, in some case perhaps a little longer. Attempts have been made to predict these variations for even longer. A number of estimates from numerical models for Pacific sea surface temperature predictions outside the tropical belt and for predictions across the other ocean basins have been made, and there are indications of some skill, although this is varies by region. Empirical prediction models designed for the Atlantic and Indian Ocean basins also give some skill, but overall highest skill levels are clearly achieved for the tropical Pacific. The creation of an array of moored data buoys across this basin, an outcome of the TOGA project, has underpinned this ability. Some moored buoys are present across the other two basins but these arrays are not yet suitable for use in operational prediction. A new global array of drifting buoys, ARGO, is expected to provide useful information for seasonal prediction in the near future.

Regarding the atmosphere, predictions of spatially relatively homogenous variables, such as temperature, are more skilful than those for the more heterogeneous variables such as rainfall. Nevertheless rainfall can be predicted with a high level of skill in some regions, predictions for the Nordeste of Brazil having a correlation approaching 0.9. It is likely that the high skill for this region results from strongly linear influences from both the Pacific and Atlantic oceans, although rainfall predictions for some tropical Pacific Rim areas approach similar levels of skill. Highest levels of skill in general are in those regions where variations are strongly linked to sea surface temperature variations, and these tend to be at tropical latitudes, particularly close to the Pacific Ocean. In general skill tends to fall off as distance from the equator increases, although regions such as parts of North America are favoured with relatively high levels of skill for their latitude because of the specific manner in which the global atmosphere responds to changes in the tropical Pacific. Results from the PROVOST experiment appear to indicate that predictability for Europe is mainly restricted to El Niño and La Nina periods.

Skill is also dependent on two further considerations. The highest skills are found in general for the shortest range predictions. As the forecast range and/or the lead time increases so skill tends to decline. The effect has been well demonstrated for sea surface predictions in the Pacific, but is less fully documented for temperature and rainfall predictions. Some empirical methods have been assessed for forecast lead periods exceeding a year, and apparently retain some skill over limited regions, but few statistics are available as yet for numerical models beyond four months. DEMETER is one project that will provide extensive statistics out to six months.

Seasonality is the second consideration, and again statistics are limited, but it is clear that skill varies in a seasonal manner that is difficult to generalise. In the middle northern latitudes skill in general is highest in winter and spring and lowest in autumn. Elsewhere skill for both temperature and rainfall predictions tends to maximise when linear links with sea surface temperatures are strongest. Predictions for Pacific Ocean sea surface temperatures tend to have lowest skill when projected through the April/May period, but skill recovers quickly in June. Known as the predictability barrier, the effect inhibits predictions of the build-up of El Nino events until the later part of the year.

Whilst projects such as DEMETER will provide substantial information concerning the quality of predictions from numerical models there is as yet no equivalent project for detailing the efficacy of empirical models. It is anticipated that the standardisation exercises will facilitate intercomparison of skill levels achievable through the various modelling approaches. At present it is probably fair to state that, in general, the empirical and numerical approaches provide forecasts of comparable skill, although there certainly are regions and/or seasons where one approach demonstrably outperforms the other. Numerical approaches have many potential advantages over empirical methods and are likely to become increasingly predominant. Empirical methods, on the other hand, will continue to play a fundamental role in exploration, in providing performance benchmarks for the numerical models, and in offering opportunities for developing prediction systems available to all.

0 0

Post a comment