## Future Directions

What changes may bring the future to climate time series analysis? First we outline (Sections 9.1, 9.2 and 9.3) more short-term objectives of "normal science" (Kuhn 1970), extensions of previous material (Chapters 1, 2, 3, 4, 5, 6, 7 and 8). Then we take a chance (Sections 9.4 and 9.5) and look on paradigm changes in climate data analysis that may be effected by virtue of strongly increased computing power (and storage capacity). Whether this technological achievement comes in the form of grid computing (Allen 1999; Allen et al. 2000; Stainforth et al. 2007) or quantum computing (Nielsen and Chuang 2000; DiCarlo et al. 2009; Lanyon et al. 2009)—the assumption here is the availability of machines that are faster by a factor of ten to the power of, say, twelve, by a mid-term period of, say, less than a few decades.

### 9.1 Timescale modelling

Climate time series consist not only of measured values of a climate variable, but also of observed time values. Often the latter are not evenly spaced and also influenced by dating uncertainties. Conventional time series analysis largely ignored uneven and uncertain timescales, climate time series analysis has to take them into account.

The process that generated the times, {tX(i)} for univariate and also {tY(j)} for bivariate series, depends on the climate archive. We have studied linear and piecewise linear processes for speleothem or sedimentary archives (Section 4.1.7) and nonparametric models for ice cores (Section 8.6.1). Such types of models are the basis for including uncertain timescales in the error determination by means of bootstrap resampling ({tX(i)} and also {tY(j)}). In bivariate and higher dimensional estimation problems, also the joint distributions of the timescale processes are

M. Mudelsee, Climate Time Series Analysis, Atmospheric and 383

Oceanographic Sciences Library 42, DOI 10.1007/978-90-481-9482-7_9, © Springer Science+Business Media B.V. 2010

important. See the example of the Vostok ice core (Section 8.6.1) with the coupled timescales for the ice and the gas.

Climate archive modelling should be enhanced in the future to provide accurate descriptions of uncertain timescales. Archive models should evidently include the physics of the accumulation of the archive. One may even think of physiological models describing the performance of humans in layer counting of regular sequences such as varves (Table 1.3). A second ingredient of climate archive modelling are statistical constraints, for example, a strictly monotonically increasing age-depth curve in a speleothem archive or an absolutely dated fixpoint in a marine sediment core. An exemplary paper (Parrenin et al. 2007) of climate archive modelling studies the accumulation and flow in an ice sheet, into which a core is drilled. The Bayesian approach may be suitable for combining the inputs from physics and statistical constraints (Buck and Millard 2004).

### 9.2 Novel estimation problems

Chapters 2, 3, 4, 5 and 6 presented stochastic processes and estimation algorithms for inferring the fundamental properties of univariate climate processes in the climate equation (Eq. 1.2): trend, variability, persistence, spectrum and extremes. Chapters 7 and 8 studied bivariate processes: correlation and the regression relation between two univari-ate processes. We believe to have covered with these chapters the vast majority of application fields for the climate sciences.

However, in science there is always room for asking more questions, that means in a quantitative approach, for attempting to estimate different climate parameters in the uni- or bivariate setting.

An obvious example of such a novel estimation problem is SSA, mentioned in the background material of Chapter 1. This decomposition method has been formulated so far only for evenly spaced, discrete time series. Interpolation to equidistance is obsolete because it biases the objectives of the decomposition (estimates of trend, variability, etc.). SSA formulations applicable to unevenly spaced records should therefore be developed.

Other novel estimation approaches are expected to come from the array of nonlinear dynamical systems theory (Section 1.6). This field has a focus more on application data from controlled measurements or computer experiments and less on unevenly spaced, short paleoclimatic time series. A breakthrough, also with respect to SSA, may come from techniques of reconstructing the phase space at irregular points.

### 9.3 Higher dimensions

Climate is a complex, high-dimensional system, comprising many variables. Therefore it makes sense to study not only univariate processes (Part II), X, or bivariate processes (Part III), X and Y, but also trivari-ate processes, X and Y and Z, and so forth. A simple estimation problem for such high-dimensional processes is the multivariate regression, mentioned occasionally in previous chapters (Sections 4.2 and 8.7),

Y(i) = 0Q + 0iX(i) + 02Z(i) + ■ ■ ■ + Sy(i) ■ Ynoise(i). (9.1)

The higher number of dimensions may also result from describing the climate evolution in the spatial domain (e.g., X is temperature in the northern, Y in the southern hemisphere). There is a variety of high-dimensional, spatial estimation problems: multivariate regression, PCA and many more (von Storch and Zwiers 1999: Part V therein).

As regards the bootstrap method, there is no principle obstacle to perform resampling in higher dimensions. An important point is that resampling the marginal distributions, of X and Y and Z separately, is not sufficient; the joint distribution of (X, Y, Z), including dependences among variables, has to be resampled to preserve the original covari-ance structure. This requires adaptions of the block bootstrap (MBB) approach. A further point, which may considerably exacerbate the estimation as well as the bootstrap implementation, is unequal observation times. The sets

need not be identical. Depending on the estimation problem and the properties of the joint climate data generating process (e.g., persistence times), the algorithm for determining 0Q,01, 02, and so forth, has to be adapted. This is a step into new territory. An example from the bivariate setting is the "synchrony correlation coefficient" (Section 7.5.2). A final point of complication from the move into higher dimensions is dependence among the timescale variables. Since this type of complication can occur already in two-dimensional problems (Section 8.6.1), we expect it in higher dimensions as well. This challenge must be met by means of timescale modelling (Section 9.1).

### 9.4 Climate models

Computer models render the climate system in the form of mathematical equations. The currently most sophisticated types, AOGCMs (Fig. 1.9), require the most powerful computers. Nevertheless, the rendered spatial and temporal scales are bounded by finite resolutions and finite domain sizes. Also the number of simulated climate processes is limited.

The problem of a finite spatial resolution is currently tackled by means of using an AOGCM (grid size several tens to a few hundred kilometres) for the global domain and nesting into it a regional model or RM (grid size reduced by a factor ~ 20) for a sub-domain of interest (say, Europe). The AOGCM "forces" the RM (Meehl et al. 2007; Christensen et al. 2007), that means, prescribes the conditions at the boundaries of the sub-domain. Sub-grid processes, not resolved even by the RM (e.g., cloud processes) and therefore not explicitly renderable by the AOGCM-RM combination, can be implicitly included by employing inferred parametric relations (e.g., between cloud formation and temperature). The AOGCM-RM combination includes many variables, (X, Y, Z,...)' = X, from the climate at grid points, and many parameters, (0o, 02,...)' = 0, from the parameterizations (Stensrud 2007) and other model equations. For convenience of presentation, we consider the climate variable vector, X, and the climate model parameter vector, 0.

Our premise of a future "quantum boost" by a factor ~ 1012 can make regionalization dispensable and let more realistic AOGCMs (grid size several tens to a few hundred metres) become calculable with computing times reduced from, say, a year to less than a month. Regarding the sophistication of a climate model, the increased computing power can also be utilized for including processes from the fields of biology and economy (greenhouse gas emissions (Moss et al. 2010) and "climate engineering" measures). Indeed, a finer spatial grid does require more processes to be explicitly included. Regarding the temporal scale, the boost should allow to simulate much larger spans (transient paleoclimate runs) by the means of AOGCMs and their successors.

There exists, however, another field where to invest computing power, namely the uncertainty determination of climate model results. We sketch this area in light of the methodology presented in this book, statistical estimation and bootstrap resampling.

Physics describes climate dynamics by means of nonlinear coupled differential equations,

where the dot denotes time derivative, f is a function, and R represents uncoupled, external forcing variables (e.g., solar activity). Time discretization yields

where AT is a time step, in an AOGCM typically in the order of minutes to hours. From an initial climate state, X(1), the climate evolution is derived. This sample from the climate model "archive" is

The climate evolution can also be observed, yielding a multivariate time series sample,

The observations are, of course, strongly limited in the number of climate variables, geographic locations and time resolutions. There have been few observations made of, say, temperature in 1000 m height above sea-level at 130°W, 30°S for the time interval from 1850 to 2010 and a spacing of d(i) = AT = 30 minutes.

### 9.4.1 Fitting climate models to observations

Let us view climate modelling as an estimation problem. The task is to estimate the model parameters, 6, given observations, {xo(i)}™=1. This set shall include the "missing observations." The task requires to run the model and produce {x(i)}™=1. The less distant the model output is to the observations, the better the fit.

Let us introduce a cost function to measure the distance,

g may be a form of a generalized least-squares cost function that takes into account predictor uncertainty and the degrees of freedom; Section 9.4.3 considers the design of g in more detail. The parameter estimate minimizes the cost function,

The parameter vector is included in the right-hand side of the equation because the model output, {x(i)}™=1, depends on it.

The outlined procedure is with current computing power not feasible for a full estimation of AOGCM parameters. It has been performed for a simple climate model containing only three variables (Hargreaves and Annan 2002) and an Earth system model of intermediate complexity (Paul and Schafer-Neth 2005). The concept of fitting climate models to data is also denoted as data assimilation or state estimation (Wunsch 2006).

Subsequent to the estimation, we should like to know the parameter uncertainties for the fitted climate model. This knowledge may be achieved by means of bootstrap methods, producing the replications,

The observation resample, xo(i), can be obtained via the surrogate data bootstrap (Section 3.3.3), taking into account the errors of the observation devices, the distributional shapes (which may be Gaussian or not), the covariances (which may be rather small) and the "internal climate variability" (which may have to be estimated by means of separate model experiments). The model output resample, x*(i), incorporates a new (trial) set of parameters, 6*. However, it should also be based on a random initial state, x*(1), because the initial conditions are not exactly known. x*(1) may be taken randomly from a set of time series values, {xunforced(j)}j=i, of a climate model run without forcing components (stationarity). This "ensemble technique" is already currently being applied to quantify the uncertainty component owing to imperfectly known initial conditions (Randall et al. 2007; van der Linden and Mitchell 2009). Also the forcing variable, R(i), may have to be described stochastically for being included in the surrogate data approach.

The replications, {6 }B=1, serve in the usual manner (Section 3.4) for constructing CIs. Of particular interest should be the joint PDF of the climate model parameter estimators, which may be described by means of confidence regions in the parameter hyperspace (Smith et al. 2009; Tebaldi and Sanso 2009). Realistic climate model error and CI determination do not require a handful of runs (current ensemble technique) but rather B runs, with B in the usual order of 2000 or even higher (because of the dimensionality).

### 9.4.2 Forecasting with climate models

Models are employed to forecast future climate, x(n + 1), at time t(n + 1). (Indeed, forecasts are made for many time steps to cover the typical range from the present to the year 2100.) This is achieved in our vision by a run of the model employing the estimated, optimal parameters, 0. That run has to use also a guess of the future forcing, R(n + 1).

Of crucial importance, scientifically and socioeconomically, is to determine the size of the forecasting error. The bootstrap methodology, utilized for that purpose in the bivariate setting (Section 8.5), should be helpful also in the high-dimensional setting.

The recommendation is to produce forecast resamples, x*(n + 1), from which to calculate standard errors, CIs, confidence bands (over a time span), and so forth.

How are the x*(n +1) produced to reflect the full range of the various sources of uncertainty?

■ The parameterization uncertainty can be taken into account by re-

sampling from the set of replications, {0 }jB=1. This preserves the covariance structure of the parameter estimates.

■ The initial-condition uncertainty can be taken into account by means of the ensemble technique.

■ The forcing uncertainty may be difficult to include in a quantitative manner. This step does likely necessitate the usage of separate forcing models.

### 9.4.3 Design of the cost function

Designing the cost function (Eq. 9.7) is important for achieving small standard errors and narrow CIs for the climate forecasts and the model parameter estimates. It is rather difficult to demonstrate theoretically the optimality of a certain cost function. One should perform Monte Carlo simulations to find "empirically" a suitable function. The following points may guide the design endeavour.

■ A least-squares technique is mandatory. It seems impossible to write down a likelihood function (for maximization) owing to the size of the body of the climate model equations. One may wish to make the sum of squares more robust with respect to "outliers." On the other hand, one may give the "outliers" instead more weight in situations where the focus is on modelling the climate extremes.

■ GLS, employing the covariance matrices (variability, persistence) of the many climate variables, is a possible technique to reduce the estimation standard errors. The normalization (variability) produces dimensionless SSQG terms for each variable, which can be processed further (e.g., summed up).

■ A problem is multicollinearity (correlated predictors), stemming from spatial dependence among the climate variables (neighboured grid points). This may indicate to reduce the number of variables in the cost function by means of spatial binning. PCA techniques should help evaluating geographically meaningful bins (regions).

■ Errors in the observations (Sx, SY, Sz,...) should lead researchers to consider techniques like WLSXY estimation (Section 8.1.2) to reduce estimation bias.

■ Further weighting could be performed "in the time domain" to enforce, for example, the most recent years to be more accurately simulated.

■ The degrees of freedom, v, of the observation-model combination can be taken into account (a simple division by v).

■ One may put bounds to the 0 hyperspace to exclude estimation results that are inconsistent with physics (hard bounds) or prior knowledge (soft bounds). Bayesian formulas may help here.

The envisaged availability of "quantum computing power" does not release us from the task of constructing efficient methods to search through the hyperspace, to locate the minimum of the cost function: gradient techniques, Brent's search, hybrid procedures or Bayesian approaches (Monte Carlo Markov Chain, see Hargreaves and Annan (2002) and Leith and Chandler (2010)).

9.4.4 Climate model bias

Climate model bias regards, generally speaking, a function of the climate variable vector, n = h (X). (9.10)

The function, h, can be used to make n an index variable or extract a geographic region. For example, we may wish to study time-dependent, annual-mean, regional-mean, land-surface precipitation in central Europe, n(j )= n-1 n-1 £ £ Xfc (i), (9.11)

k € region T(i) € year j where Xk(i) is precipitation at grid point k and time T(i), ni is the number of time values within year j and nk is the number of model grid points within central Europe.

Let us now view the modelled sequence as an estimate obtained by means of a climate model, i(j). Next we consider the true sequence. Since the truth is hidden, we take instead an observed sequence, no(j). This leads, in analogy to Eq. (3.2), to the climate model bias, bias^(j) = E [i(j)] - no(j). (9.12)

In the example of precipitation in central Europe, there are indications from a range of AOGCM-RM combinations that bias^(j) > 0 for the time interval from 1950 to the very recent past (Jacob D 2009, personal communication), that is, the climate models systematically overestimate precipitation. Similar overestimations were found for the region of Scandinavia (Goodess et al. 2009).

In the context of climate forecasting (Section 9.4.2), better predictions may therefore include a climate model bias correction. For example, if the model bias is simply a constant, bias^, then n'(jfuture) = n(jfuture) - bias^, (9.13)

where jfuture indicates future (unobserved) time and the prime denotes bias correction. Evidently, the time-dependence of the bias and also its form (additive, multiplicative) should be analysed in such situations. Further developments may employ more complex stochastic models of the climate model bias (Jun et al. 2008).

### 9.5 Optimal estimation

Increased computer power would also allow to perform optimal estimation. We have sketched this concept in previous parts of this book (Sections 6.2.7 and 7.5.3.1). Not only climatology, other science branches as well may benefit from optimal estimation.

Central to the investigation in natural sciences, such as climatology, is to infer the truth from the data. This calls for the statistical language. In quantitative climatology, the investigative questions can be translated into a parameter, 0, which needs to be estimated using the data. The investigation cycles through loops: question, estimation, refined question based on the estimation result, new estimation, and so forth.

An estimator, 0, is a recipe how to guess 0 using the data. Since the sample size is less than infinity and the sampled climate system contains unknown influences (noise), we cannot expect that 0 equals 0. However, we can calculate the size of that error, the uncertainty. This leads to the measures se^, bias^, RMSEg- and the confidence interval, CIg-1-2a, which is thought to include 0 with probability 1 — 2a. Without having the information contained in such measures, it is difficult to assess how close 0 is to 0: estimates without error bars are useless.

For simple estimation problems (e.g., mean estimation) and simple noise properties (e.g., Gaussian distributional shape), the error measures can be analytically derived via the PDF of an estimator. However, climate is more complex—as regards the noise as well as the estimation problem. This book advocates therefore the bootstrap resampling approach, which allows to analyse complex problems for realistic (i.e., complex) properties such as non-Gaussian shape or serial dependence.

For the most part of this book, we have assumed the uncertainty to have its origin in the complex climate system and the measured variables (proxy, measurement and dating errors). We have occasionally considered (Sections 4.1.7.4, 4.4 and 8.3.4) another error source, a mis-specified model. Statistical science refers to this error source as model uncertainty; see Chatfield (1995), Draper (1995), Candolo et al. (2003) and Chatfield (2004: Section 13.5 therein). By fitting a range of candi date models it is possible to infer the range of feasible estimation outcomes. For example, one may compare the estimated 100-year return level, HQioo, from a Weibull fit with the estimated HQioo from a GEV fit to observed runoff data, and look whether the difference of the results is comparable to the statistical standard errors. Note that model uncertainty may regard also the assumed noise model (e.g., short versus long memory). A method to reduce model uncertainty is to employ graphical and computational tests of model suitability. As a method to quantify model uncertainty, we may study not only the range of the estimation outcomes but impose a weighting according to the probability that a particular model is correct. The "model probability" may be based in a Bayesian approach on a prior consultation of experts (Smith et al. 2009). In the example of HQ100, there is hope that the hydrologists would put more weight on the GEV model than on the Weibull. It is principally possible to add model uncertainty as a new dimension to the hyperspace of climate estimation (Fig. 9.1).

Climate is a paradigm of a complex system that requires for its analysis the bootstrap. In addition, climate opens the new problem dimensions of unequally spaced series and timescale errors. This book has presented various bootstrap algorithms to adapt closely to the estimation problem imposed by the data: ARB, MBB, SB, surrogate data, timescale-ARB, timescale-MBB, pairwise-ARB, pairwise-MBB and pair-wise-MBBres. It also described algorithms to support bootstrap resampling and CI construction: block length selection, calibration, the CI types normal, Student's t, percentile and BCa.

The critical question is: What is the best method for inferring the truth from the data? What is the optimal estimation method, and how are the most accurate CIs constructed?

Future, strongly increased computing power allows to approach that question by means of Monte Carlo experiments. We outline this optimal estimation approach (Fig. 9.1). We reiterate that optimal estimation is not limited to the field of climate sciences.

The hyperspace of climate estimation has many, but not infinite dimensions. It consists of the three subspaces Monte Carlo design, method and measure.

The Monte Carlo design (Fig. 9.1) describes the data generating process. The design is used to generate artificial data, to which the method is applied. The design should, in some sense, cover the estimation problem (data and estimation) to be carried out. One group of dimensions is occupied by the type of estimation model and the parameters. For example, one may be interested in a linear regression model with the two parameters intercept and slope (Chapters 4 and 8). To restate, the

A CIa accuracy

Measure

Method

### Monte Carlo design size Spacing

Figure 9.1. Hyperspace of climate parameter estimation. The Monte Carlo experiment prescribes the stochastic model, parameters and other properties (shape, sample size, spacing, persistence, etc.) in a way that the problem at hand (data and estimation) is covered. The method regards estimation and CI construction. The optimal estimation is determined by using a measure.

A CIa accuracy

Measure

Method

### Monte Carlo design size Spacing

Figure 9.1. Hyperspace of climate parameter estimation. The Monte Carlo experiment prescribes the stochastic model, parameters and other properties (shape, sample size, spacing, persistence, etc.) in a way that the problem at hand (data and estimation) is covered. The method regards estimation and CI construction. The optimal estimation is determined by using a measure.

Monte Carlo parameters (e.g., prescribed intercept and slope) should be close to the estimated parameters (estimated intercept and slope). The other group of dimensions in the Monte Carlo subspace describe the sample size (prescribed n, which should be close to the size of the sample at hand), the spacing (again, similar to the spacing of the sample) and the noise properties (also similar). An option is to invest three dimensions to model the persistence of the noise as an ARFIMA(p, q) process (which contains the simpler types such as AR(1)) and one or two to model the shape (skewness, kurtosis). Heteroscedasticity may also be modelled. The ARFIMA process contains the preferred parsimonious, embedding-problem free AR(1) process (p = 1,5 = 0,q = 0). Some dimensions have integer values (e.g., the ARFIMA parameter p), some have real values (e.g., the slope parameter). Timescale errors may also be modelled (additional dimensions).

The method subspace (Fig. 9.1) describes the estimation and CI construction. The ticks along the estimator dimension are named least squares, maximum likelihood, and so forth. CI construction requires more dimensions: one for distinguishing between classical and bootstrap CIs, and several for detailing the bootstrap methodology (block length selection for MBB, calibration, subsampling, etc.) and calculating the interval bounds from the replications. Consider, for example, the brute-force block length selector (Berkowitz and Kilian 2000): one dimension with integer values between 1 and n - 1.

The measure subspace (Fig. 9.1) describes how to detect the optimal estimation method for the Monte Carlo experiment: CI accuracy and width, RMSE, bias, robustness, and so forth. It should make sense to consider also joint measures (e.g., CI accuracy and robustness).

The hyperspace of climate parameter estimation is large. Present computing power limits our ability to explore it and find the optimal method for solving a (climate) estimation problem. This book has examined many important estimation problems (regression, spectrum, extremes and correlation) but visited only parts of the hyperspace by means of Monte Carlo experiments. For example, in linear regression (Chapter 4), we have studied

■ spacing: even and uneven (timescale errors);

■ shape: Gaussian and lognormal;

■ persistence: AR(1), AR(2) and ARFIMA(0,0.25,0);

■ estimator: least squares only;

■ resampling: ARB, MBB, subsampling, timescale-ARB, timescale-MBB and pairwise-MBB;

■ CI type: classical and bootstrap BCa;

■ calibration loop: none; and

■ measure: RMSE, CI accuracy and CI length.

We have found "acceptable" results (mainly judged via CI accuracy) from the bootstrap method applied to Monte Carlo samples generated from designed processes that are considered as close to the climate processes. These positive results have given us confidence that the results (estimate with CI) from analysing the observed, real climate time series are valid. However, we have to concede that there may exist more accurate methods, resulting in particular from (computing-intensive) CI calibration. This may be of relevance especially for small sample sizes.

The envisaged large increase in computing power may bring the following idea of optimal climate estimation into existence. Given a time series, {t(i),x(i)}™=1, some prior information (e.g., measurement standard errors, age-depth curve) and a set of questions (parameters to be estimated), the first task is simple: perform an initial estimation on basis of existing knowledge and experience with such types of estimation problems. The second task requires the computing power: explore the hyperspace (Fig. 9.1) to find the suitable method, that is, the mode of estimation and CI construction that optimizes a selected measure for prescribed values close to the initial estimates. Also here, intelligent exploration methods (gradient, Brent, etc.) are useful. The third task is to apply the optimal estimation method to the climate time series.

## Post a comment