## Bootstrap Confidence Intervals

In statistical analysis of climate time series, our aim (Chapter 1) is to estimate parameters of Xtrend(T), Xout(T), S(T) and Xnoise(T). Denote in general such a parameter as 9. An estimator, 0, is a recipe how to calculate 9 from a set of data. The data, discretely sampled time series {t(i), x(i)}™=1, are influenced by measurement and proxy errors of x(i), outliers, dating errors of t(i) and climatic noise. Therefore, 0 cannot be expected to equal 9. The accuracy of 0, how close it comes to 9, is described by statistical terms such as standard error, bias, mean squared error and confidence interval (CI). These are introduced in Section 3.1.

With the exploration of new archives or innovations in proxy, measurement and dating techniques, new 90 values, denoted as estimates, become available and eventually join or replace previous estimates. A telling example from geochronology is where 9 is the time before present when the Earth's magnetic field changed from reversed polarity during the Matuyama epoch to normal polarity during the Brunhes epoch, at the beginning of the late Pleistocene. Estimates published over the past decades include 690 ka (Cox 1969) and 730 ka (Mankinen and Dal-rymple 1979), both based on K/Ar dating; and 790 ka (Johnson 1982) and 780 ka (Shackleton et al. 1990), both based on astronomical tuning. The currently accepted value is 779 ka with a standard error of 2 ka (Singer and Pringle 1996), written as 779 ± 2ka, based on 40Ar/39Ar dating (a high-precision variant of K/Ar dating). An example with a much greater uncertainty regards the case where 9 is the radiative forcing (change in net vertical irradiance at the tropopause) of changes in atmospheric concentrations of mineral dust, where even the sign of 9 is uncertain (Penner et al. 2001; Forster et al. 2007). It is evident that the

M. Mudelsee, Climate Time Series Analysis, Atmospheric and 65

Oceanographic Sciences Library 42, DOI 10.1007/978-90-481-9482-7_3, © Springer Science+Business Media B.V. 2010

growth of climatological knowledge depends critically on estimates of 0 that are accompanied by error bars or other measures of their accuracy.

Bootstrap resampling (Sections 3.2 and 3.3) is an approach to construct error bars and CIs. The idea is to draw random resamples from the data and calculate error bars and CIs from repeated estimations on the resamples. For climate time series, the bootstrap is potentially superior to the classical approach, which relies partly on unrealistic assumptions regarding distributional shape, persistence and spacing (Chapter 1). However, the bootstrap, developed originally for data without serial dependence, has to be adapted before applying it to time series. Two classes of adaptions exist for taking persistence into account. First, nonparametric bootstrap methods resample sequences, or blocks, of the data. They preserve the dependence structure over the length of a block. Second, the parametric bootstrap adopts a dependence model. As such, the AR(1) model (Chapter 2) is our favorite.

It turns out that both bootstrap resampling types have the potential to yield acceptably accurate CIs for estimated climate parameters. A problem for the block bootstrap arises from uneven time spacing. Another difficult point is to find optimal block lengths. This could make the parametric bootstrap superior within the context of this book, especially for small data sizes (less than, say, 50). The block bootstrap, however, is important when the deviations from AR(1) persistence seem to be strong. Various CI types are investigated. We prefer a version (so-called BCa interval) that automatically corrects for estimation bias and scale effects. Computing-intensive calibration techniques can further increase the accuracy.

### 3.1 Error bars and confidence intervals

Let 0 be the parameter of interest of the climatic process {X(T)} and 0 be the estimator. Extension to a set of parameters is straightforward. Any meaningful construction lets the estimator be a function of the process, 0 = g ({X(T)}). That means, 0 is a random variable with statistical properties. The standard deviation of 0, denoted as standard error, is biasg- > 0 (biasg- < 0) means a systematic overestimation (underestimation). se^ and biasg- are illustrated in Fig. 3.1. Desirable estimators have small se^ and small bias^. In many estimations, a trade-off problem

The bias of 00 is Figure 3.1. Standard error (seg), bias (biasg) and equi-tailed confidence interval (CIg 1_2a = [^l; fuj) for a Gaussian distributed estimator, 0. The true parameter value is 0; the confidence level is 1 — 2a = 90%.

between seg- and biasg- occurs. A convenient measure is the root mean squared error,

The coefficient of variation is

While 0 is a best guess of d or a point estimate, a CI is an interval estimate that informs how good a guess is (Fig. 3.1). The CI for d is

where 0 < 1 — 2a < 1 is a prescribed value, denoted as confidence level. The practical examples in his book consider 90% (a = 0.05) or 95% (a = 0.025) CIs, which are reasonable choices for climatological problems. 0 is the lower, O the upper endpoint of the CI. 0 and O are random variables and have statistical properties such as standard error or bias. The properties of interest for CIs are the coverages,

Exact CIs have coverages, y, equal to the nominal value 1 — 2a. Construction of exact CIs requires knowledge of the distribution of 0, which can be achieved only for simple problems. In more complex situations, only approximate CIs can be constructed (Section 3.1.3). As regards the division of the nominal coverage between the CI endpoints, this book adopts a practical approach and considers only equi-tailed CIs, where nominally Yi = Yu = a. As a second CI property besides coverage, we consider interval length, 0u — °i, which is ideally small.

Preceding paragraphs considered estimators on the process level. In practice, on the sample level, we plug in the data {t(i), x(i)}™=1 for {T(i),X(i)}n=1. Following the usual convention, we denote also the estimator on the sample level as An example is the autocorrelation estimator (Eq. 2.4).

3.1.1 Theoretical example: mean estimation of Gaussian white noise

which is called a Gaussian purely random process or Gaussian white noise. There is no serial dependence, and the times T(i) are not of interest. Consider as estimator 0 of the mean, the sample mean, written on process level as n

Let also a be unknown and estimated by the sample standard deviation, 0 = Sn-1, given in the next example (Eq. 3.19). The properties of X readily follow as seX = a ■ n-1/2, (3.11)

An exact CI of level 1 — 2a can be constructed by means of the Student's t distribution of X (von Storch and Zwiers 1999):

X + tn— i(a) • Sn-i • n-1/2;X + tn-1(1 — a) • Sn-1 • n-1/2

tv (ft) is the percentage point at ft of the t distribution function with v degrees of freedom (Section 3.9).

On the sample level, we write the estimated sample mean, n x(

the estimated standard error,

and the confidence interval,

CIx,i-2a = x + tn-i(a) • Sn-1 • n 1/2; x + tn-1(1 — a) • Sn-1 • n 1/2

The performance of the CI in Eq. (3.18) for Gaussian white noise is analysed by means of a Monte Carlo simulation experiment. The CI performs excellent in coverage (Table 3.1), as expected from its exactness. The second CI property, length, decreases with data size. It can be further compared with CI lengths for other location measures.

3.1.2 Theoretical example: standard deviation estimation of Gaussian white noise

Consider the Gaussian white-noise process (Eq. 3.9) with unknown mean, and as estimator of a the sample standard deviation, written on process level as

Table 3.1. Monte Carlo experiment, mean estimation of a Gaussian purely random process. nsim = 4,750,000 random samples of {X(i)}™=1 were generated after Eq. (3.9) with ^ = 1.0, a = 2.0 and various n values. An exact confidence interval CIx,1—2a was constructed for each simulation after Eq. (3.18) with a = 0.025. Average CI length, empirical RMSEx and empirical coverage were determined subsequently. The entries are rounded.

Table 3.1. Monte Carlo experiment, mean estimation of a Gaussian purely random process. nsim = 4,750,000 random samples of {X(i)}™=1 were generated after Eq. (3.9) with ^ = 1.0, a = 2.0 and various n values. An exact confidence interval CIx,1—2a was constructed for each simulation after Eq. (3.18) with a = 0.025. Average CI length, empirical RMSEx and empirical coverage were determined subsequently. The entries are rounded.

 n RMSE| Nominalh ( CI length }c Nominald Yx Nominal 10 0.6327 0.6325 2.7832 2.7832 0.9499 0.9500 20 0.4474 0.4472 1.8476 1.8476 0.9498 0.9500 50 0.2828 0.2828 1.1310 1.1310 0.9501 0.9500 100 0.2000 0.2000 0.7916 0.7917 0.9499 0.9500 200 0.1415 0.1414 0.5570 0.5571 0.9499 0.9500 500 0.0894 0.0894 0.3513 0.3513 0.9500 0.9500 1000 0.0633 0.0632 0.2482 0.2482 0.9499 0.9500

c Average value over nsim simulations.

2 • tn— 1 (1 - a) • a • c • n 1/2, where c is given by Eq. (3.24). e Empirical coverage, given by the number of simulations where CIx,1—2a contains M, divided by nsim. Standard error of Yx is (Efron and Tibshirani 1993) nominally [2a(1 - 2a)/nsim]1/2 = 0.0001.

c Average value over nsim simulations.

2 • tn— 1 (1 - a) • a • c • n 1/2, where c is given by Eq. (3.24). e Empirical coverage, given by the number of simulations where CIx,1—2a contains M, divided by nsim. Standard error of Yx is (Efron and Tibshirani 1993) nominally [2a(1 - 2a)/nsim]1/2 = 0.0001.

The properties of Sn-1 are as follows:

c = [2/(n - 1)]1/2 ■ r(n/2) / r((n - 1)/2). (3.24)

On the sample level, we write

Table 3.2. Monte Carlo experiment, standard deviation estimation of a Gaussian purely random process. nsim = 4,750,000 random samples of {X(i)}™=1 were generated after Eq. (3.9) with ^ = 1.0, a = 2.0 and various n values. An exact confidence interval CISn_1i1-2a was constructed for each simulation after Eq. (3.26) with a = 0.025. Average CI length, empirical RMSESn-1 and empirical coverage were determined subsequently.

Table 3.2. Monte Carlo experiment, standard deviation estimation of a Gaussian purely random process. nsim = 4,750,000 random samples of {X(i)}™=1 were generated after Eq. (3.9) with ^ = 1.0, a = 2.0 and various n values. An exact confidence interval CISn_1i1-2a was constructed for each simulation after Eq. (3.26) with a = 0.025. Average CI length, empirical RMSESn-1 and empirical coverage were determined subsequently.

 n RMSES , Norninalh { CI length }c Nominald Nominal 10 0.4677 0.4677 2.2133 2.2133 0.9500 0.9500 20 0.3232 0.3233 1.3818 1.3819 0.9500 0.9500 50 0.2018 0.2018 0.8174 0.8174 0.9499 0.9500 100 0.1421 0.1420 0.5659 0.5659 0.9499 0.9500 200 0.1002 0.1002 0.3960 0.3960 0.9500 0.9500 500 0.0633 0.0633 0.2489 0.2489 0.9500 0.9500 1000 0.0447 0.0447 0.1757 0.1757 0.9501 0.9500

a Empirical RMSEsn-1, given by [E^ST (sn-1 - ct)2 /nsim] 1 . b ct • [2(1 - c)]112.

c Average value over nsim simulations.

d [(xn-1(1 - «))-112 - (xn-1 («))-112] • ct • c • (n - 1)1/2.

e Empirical coverage, given by the number of simulations where CIs 1,1-2a contains ct, divided by nsim. Standard error of Ys 1 is nominally [2a(1 - 2a)/nsim]1/2 = 0.0001.

a Empirical RMSEsn-1, given by [E^ST (sn-1 - ct)2 /nsim] 1 . b ct • [2(1 - c)]112.

c Average value over nsim simulations.

d [(xn-1(1 - «))-112 - (xn-1 («))-112] • ct • c • (n - 1)1/2.

e Empirical coverage, given by the number of simulations where CIs 1,1-2a contains ct, divided by nsim. Standard error of Ys 1 is nominally [2a(1 - 2a)/nsim]1/2 = 0.0001.

and use the chi-squared distribution of (von Storch and Zwiers

1999) to find

where x2^) is the percentage point at fl of the chi-squared distribution function with v degrees of freedom (Section 3.9).

The performance of the CI in Eq. (3.26) for Gaussian white noise is analysed by means of a Monte Carlo simulation experiment. The CI performs excellent in coverage (Table 3.2), as expected from its exactness. The CI property length can be compared with CI lengths for other measures of spread or variation.

### 3.1.3 Real world

The two theoretical examples (Sections 3.1.1 and 3.1.2) presented convenient settings. X(i) was normally distributed and persistence was absent, for which reasons the spacing was not relevant. The simple estimators /2 and <r could then be applied for mean and standard deviation estimation, which allowed to deduce their distributions as Student's t and

Table 3.3. Monte Carlo experiment, mean and median estimation of a lognormal purely random process. nsim = 4,750,000 random samples of {X(i)}™=1 were generated after X(i) = exp a2)(i)] , i = 1,... ,n, with ^ = 1.0, a = 1.0 and various n values. The density function is skewed (Fig. 3.2). Analysed as estimators of the centre of location of the distribution were the sample mean (Eq. 3.16) and the sample median, m (see background material, Section 3.8). CIx,1—2a was constructed after Eq. (3.18) with a = 0.025.

Table 3.3. Monte Carlo experiment, mean and median estimation of a lognormal purely random process. nsim = 4,750,000 random samples of {X(i)}™=1 were generated after X(i) = exp a2)(i)] , i = 1,... ,n, with ^ = 1.0, a = 1.0 and various n values. The density function is skewed (Fig. 3.2). Analysed as estimators of the centre of location of the distribution were the sample mean (Eq. 3.16) and the sample median, m (see background material, Section 3.8). CIx,1—2a was constructed after Eq. (3.18) with a = 0.025.

 n RMSEm RMSEx Yi Nominal C b 10 1.1647 1.8575 0.8392 0.9500 -0.1108 20 0.7893 1.3140 0.8670 0.9500 -0.0830 50 0.4884 0.8309 0.8991 0.9500 -0.0509 100 0.3430 0.5880 0.9170 0.9500 -0.0330 200 0.2418 0.4155 0.9296 0.9500 -0.0204 500 0.1526 0.2627 0.9399 0.9500 -0.0101 1000 0.1078 0.1858 0.9442 0.9500 0.0058

a Standard error of Yx is nominally 0.0001.

b Empirical coverage error of CIx,i—2a, given by Yx minus nominal value.

a Standard error of Yx is nominally 0.0001.

b Empirical coverage error of CIx,i—2a, given by Yx minus nominal value.

chi-squared, respectively. Finally, exact CIs were obtained using the percentage points of the distributions of the estimators.

In the real climatological world, however, such simple assumptions regarding distributional shape, persistence and spacing cannot be expected to be fulfilled (Chapter 1). In the practical setting, further questions than just after mean and standard deviation are asked, leading to more complex parameters, 0. The major part of the rest of this book is devoted to such problems. Also the estimators of those parameters have commonly more complex distributions, f (°).

Example 3 (Table 3.3) goes a small step from the theoretical in the direction of the real world. This case illustrates the effects of violations of the distributional assumption. Example 3 assumes that X(i) are Gaussian distributed, although the prescribed true distribution is lognormal. This leads to a Student's t CI with an empirical coverage that deviates from the nominal value by several standard errors (Table 3.3). The difference is the coverage error (see next paragraph), its absolute value decreases with the data size. This CI is not exact but only approximate. Table 3.4 summarizes theoretical and practical settings.

Coverage error, C, is defined by means of a single-sided CI endpoint (Efron and Tibshirani 1993), for example,

If C decreases with sample size as O (n-1/2), that means, if C is composed of terms of powers of 1/n that are greater than or equal to 1/2, f(x) Figure 3.2. Lognormal density function from Example 3 (Table 3.3), with ^ = 1.0 and a = 1.0. The expression for f (x) is given by Eq. (3.61).

Figure 3.2. Lognormal density function from Example 3 (Table 3.3), with ^ = 1.0 and a = 1.0. The expression for f (x) is given by Eq. (3.61).

then the CI is called first-order accurate; if C is of O (n-1), then the CI is called second-order accurate; and so forth. The same CI accuracy applies also to two-sided CIs. Desirable approximate CIs have a high-order accuracy. Coverage accuracy is the major criterion employed in this book for assessing the quality of a CI. As a second property we consider interval length, 0u — 0\, which is ideally small. Related to CI accuracy is CI correctness (Efron and Tibshirani 1993: Section 22.2 therein), which refers to the difference between an exact CI endpoint (which has C = 0) and an approximate CI endpoint, expanded in terms of powers of n.

For practical situations it is conceivable that different estimators, 01 and 02, of the same parameter, 0, exist. Consider for example parameter estimation of the AR(p) model, for which Priestley (1981: Section 5.4.1 therein) gives four sets of estimators, namely exact likelihood, least squares, approximate least squares and Yule-Walker. Each estimator has its own properties such as standard error, bias, RMSE, CI length or CI coverage accuracy.

An important attribute of an estimator is robustness, which means that the d properties depend only weakly on made assumptions (shape, persistence and spacing). Robust estimators perform better (e.g., have smaller RMSE or higher coverage accuracy) than non-robust in non-ideal situations. Example 3 shows that the sample median as an estimator of the centre of location of a distribution is more robust (with regard to RMSE^) than the mean. In essence, because of the complexity of the setting in the real world and the dependence on the situation and the aims of the analysis, there is no general rule how to construct best an estimator. It has something of an art, which is not meant negatively. In this light, the growth of climatological knowledge does not only depend on more and better data but also on improved methods to analyse them.

Table 3.4 shows also how real-world climatological estimation problems may be tackled. The classical approach comes from theory and aims to extend the applicability by introducing countermeasures. Regarding distributional shape, a measure may be to estimate the shape of the noise data (Section 1.6). Then one looks up and applies the estimator for the parameter 0 that performs for this particular shape best in terms of a user-specified property, say RMSE. The CI follows from the estimator's distribution. The problem is that only for simple shapes and parameters, knowledge is available that would allow this procedure. (In this regard, the lognormal without is clearly simpler than the lognormal with shift parameter (Section 3.8).) Transformations of the data, such that the noise part has a simple shape, can also be tried, but then the problem is that the systematic part of the model (Eq. 1.2) can take intractable forms, see Atkinson and Cox (1988) on this dilemma. (The double-logarithmic transformation described in Section 2.6 was in the converse direction. It produced a simpler systematic part and a more complex noise part.)

Regarding persistence, the effective data size, n', can be used instead of n for CI calculation. The problem here is that n' depends on the persistence model and on which estimator is used (Chapter 2). One may take n'M (Eq. 2.7), n'a2 (Eq. 2.36) or n'p (Eq. 2.38) for the AR(1) process and hope that deviations to the problem at hand are small. Regarding spacing, it is fair to say that the classical approach mostly ignores unevenness because its influence on n' and the distribution of d can in the general case not be deduced. As a result, the classical approach often contents itself with approximate normality, that is, with f (0) approaching normal shape as n ^ to. For many theoretical estimations, approximate normality can be proven. However, the point is that in practice n is limited and it is mostly unknown how accurate the normal approximation of the CI is.

### 3.2 Bootstrap principle

Table 3.4 lists also the bootstrap approach to solve practical estimation problems. These tasks include constructing CIs for estimators more complex than the mean, and this in the presence of non-normal distribu-

 Distributional shape Persistence Spacing Estimator, 9 Distribution of 9, m Confidence interval, CIg,l-2a Setting Theoretical13 Known, normal No (yes) Not relevant (even) Tractable Deducible Exact Example 1 Example 2 Normal Normal No No Not relevant Not relevant p (Eq. 3.10) a (Eq. 3.19) t distribution X2 distribution Exact Exact Practical Non-normal Yes Uneven More complex than pi or a Often not deducible Exact only if f{9) deducible Example 3 Lognormal No Not relevant p (Eq. 3.10), m Ignored Student's t approximation Approach Classical Find shape, apply suitable 9 or transform x Effective data size Ignore Assume normality Approximate, based on assumptions Bootstrap Not very relevant0 Block bootstrap or parametric Not relevant"1 Not very relevant0 Approximate, based on fewer assumptions

a Indicated are the main lines of settings and approaches. Exceptions exist; for example, theory deals also with uneven spacing (Parzen 1984). b Theory must necessarily impose restrictions to shape, persistence and spacing to obtain tractable problems. c Distributional properties can influence bootstrap CI accuracy (see text), but this is a minor effect.a d Restriction: Parametric persistence models more complex than AR(1) are not considered for uneven spacing because of the embedding problem (Section 2.3).

a Indicated are the main lines of settings and approaches. Exceptions exist; for example, theory deals also with uneven spacing (Parzen 1984). b Theory must necessarily impose restrictions to shape, persistence and spacing to obtain tractable problems. c Distributional properties can influence bootstrap CI accuracy (see text), but this is a minor effect.a d Restriction: Parametric persistence models more complex than AR(1) are not considered for uneven spacing because of the embedding problem (Section 2.3).

tions, persistence and uneven spacing. The main idea of the bootstrap is to use the data to mimic the unknown distribution function, which is now replaced by the empirical distribution function (Eq. 3.43). Mimicking the data generating process is achieved by drawing random samples from the data set. The simplest form is the ordinary bootstrap, that means, drawing one by one with replacement. Preserving the persistence properties of time series data requires adaptions of the ordinary bootstrap, which are explained in Section 3.3. Re-applying the estimation procedure to the new random samples, called resamples, yields new estimates, called replications. Section 3.4 explains CI construction using the replications. Figure 3.3 shows the bootstrap principle and the workflow. It gives also a simple bootstrap CI variant (bootstrap normal CI).

The bootstrap means that numerical simulation replaces theoretical derivation of the distribution of an estimator. This can be an improvement, especially if the complexity of the problem defies obtaining an exact theoretical result. However, also the bootstrap is not free of assumptions. The main requirement is that the properties distributional shape and persistence are preserved by the bootstrap resampling. There is also "simulation noise," but this can be made arbitrarily small by using a large number of resamples, B. Assumptions made at CI construction add to the fact that in complex situations, bootstrap CIs, like classical CIs, are not exact but approximate. In complex cases, for small sample size, non-smooth functionals such as the median and without underlying theory, even the bootstrap may fail to yield acceptable results (LePage and Billard 1992). However, bootstrap CIs seem to be more flexible and require less strict assumptions than classical CIs (Table 3.4). A word on usage of "simulation:" henceforth we reserve this for Monte Carlo experiments, where statistical methods are tested by means of artificial data from models with pre-defined properties. The bootstrap procedure, on the other hand, is referred to as "resampling."

### 3.3 Bootstrap resampling

The ordinary bootstrap, resampling one by one with replacement, is a nonparametric method because it can virtually be applied to data from any continuous PDF without involvement of distributional parameters. By resampling one by one, the serial dependence in {X(i)}™=1 is lost. For the analysis of time series, the ordinary bootstrap has therefore to be adapted to take serial dependence into account. This can be done nonparametrically, by resampling block by block of data. Alternatively,