[0o z1 a

F(■) is the standard normal distribution function (Eq. 3.49). e0, the bias correction, is computed as

where #{0*b < d} means the number of replications where d*b < d and F-1(-) is the inverse function of F(■). The acceleration, a, is computed (Efron and Tibshirani 1993) as

Era j=1

where Oj is the jackknife value of 0. Consider the original sample with the jth point removed, that is, {t(i),x(i)} ,i = 1,...,n,i = j. The jackknife value is then the value of 0 calculated using this sample of reduced size. The average, , is given by e(j) / n.

e0 corrects for the median estimation bias; for example, if just half of the replications have 0*b < 0, then e0 = 0. The acceleration, a, takes into account scale effects, which arise when the standard error of 0 itself depends on the true parameter value, 0.

3.5 Examples

In the first, theoretical example, we compare classical and bootstrap CIs in terms of coverage accuracy (Table 3.5). The mean of AR(1) processes with uneven spacing was estimated for two distributional shapes, normal and lognormal. The classical CI employed the effective data size for mean estimation, the bootstrap CI used the ARB algorithm and the BCa method.

The classical CI performed better for the normal than for the lognormal shape. This is because the normal assumption made at CI construction is violated in the case of the lognormal shape. With increasing data size, the lognormal approaches the normal distribution (Johnson et al. 1994: Chapter 14 therein) and the difference in performance decreases. However, this difference is still significant for n = 1000 in the example.

Also the bootstrap CI performed better for the normal than for the lognormal shape. This may be because persistence time estimation (0) and persistence time bias correction (0') is less accurate for non-normally distributed data.

Table 3.5. Monte Carlo experiment, mean estimation of AR(1) noise processes with uneven spacing, normal and lognormal shape. nsim = 47,500 random samples were generated from the Gaussian AR(1) process, {X(i)}n_1, after Eq. (2.9) with t = 1. The samples from the lognormal AR(1) process were generated by taking exp [X (i)]. The start was set to t(1) = 1; the time spacing, d(i), was drawn from a gamma distribution (Eq. 2.48) with order parameter 16, that means, a distribution with a coefficient of variation equal to (16)-1/2 = 0.25, and subsequently scaled to d = 1. Two CI types for the estimated mean were constructed, classical and bootstrap. The classical CI employed n^ calculated from Eq. (2.7) with a = exp (—d/r') plugged in for a, and the t distribution (Eq. 3.18). The bootstrap CI used the ARB (Algorithm 3.5) and the BCa method (Section 3.4.4) with B = 1999 and a = 0.025.

 n Distribution Nominal Normal CI type Lognormal CI type Classical Bootstrap Classical Bootstrap 10 0.918 0.863 0.835 0.789 0.950 20 0.929 0.903 0.845 0.845 0.950 50 0.938 0.929 0.876 0.888 0.950 100 0.943 0.941 0.897 0.909 0.950 200 0.942 0.943 0.914 0.922 0.950 500 0.947 0.948 0.926 0.930 0.950 1000 0.947 0.949 0.933 0.937 0.950

a Standard error of Yt is nominally 0.001.

a Standard error of Yt is nominally 0.001.

For small sample sizes (n ^ 50 (normal distribution) or n ^ 20 (lognormal distribution)) the classical CI performed better than the bootstrap CI. This advantage is likely in part owing to the fact that a formula for the effective data size for mean estimation is known; it may disappear for more complex estimators, where no formula for the effective data size exists. For larger sample sizes (n ^ 100 (normal distribution) or n ^ 50 (lognormal distribution)) the bootstrap CI is as good as the classical CI (normal shape) or better (lognormal shape).

In the second, practical example, Fig. 3.4 shows the transition from a glacial (MIS 6) to the last interglacial (MIS 5) in the Vostok CO2 record. The mean CO2 concentration was estimated for the time intervals from 140 to 177ka (glacial) and from 115 to 130ka (interglacial). Student's t CIs (Section 3.4.2) were constructed using nonparametric stationary bootstrap resampling, a variant of the MBB, where the block length is

Figure 3.4. Determination of mean CO2 levels in the Vostok record (Fig. 1.3b) during a glacial and an interglacial. The interval from 140 to 177 ka represents the glacial (MIS 6), the interval from 115 to 130 ky the interglacial (marine isotope substage 5.5). The 95% bootstrap CIs for the estimated means are shown as shaded bars.

Figure 3.4. Determination of mean CO2 levels in the Vostok record (Fig. 1.3b) during a glacial and an interglacial. The interval from 140 to 177 ka represents the glacial (MIS 6), the interval from 115 to 130 ky the interglacial (marine isotope substage 5.5). The 95% bootstrap CIs for the estimated means are shown as shaded bars.

not constant (Section 3.8). The number of resamples was B = 2000. The average block length was adjusted to NINT (4 ■ r/d).

The mean glacial CO2 level was determined as 192.8 ppmv with 95% CI [188.3 ppmv; 197.3 ppmv]; the mean interglacial CO2 level was 271.9 ppmv with 95% CI [268.8 ppmv; 275.0 ppmv]. Because of the reduced data sizes in the intervals (glacial, n = 13; interglacial, n = 24), also the accuracies of the CIs may be reduced. The enormous glacial-interglacial amplitude in CO2 documents the importance of this greenhouse gas for late Pleistocene climate changes, the ice age. The relation between CO2 and temperature changes is analysed in Chapters 7 and 8.

3.6 Bootstrap hypothesis tests

By the analysis of climate time series, {t(i), x(i)}™=1, we make, generally speaking, a statistical inference of properties of the climate system. One type of inference is estimation of a climate parameter, 0. In addition to a point estimate, 0, an interval estimate, CI^1-2a, helps to assess how accurate 0 is. The bootstrap is used to construct CIs in complex situations regarding data properties shape, persistence and spacing. The second type of inference is testing a hypothesis, a statement about the climate system, using the data sample. Again, this can be a difficult task (shape, persistence, spacing), and again, the bootstrap can be a power ful tool in such a situation. Hypothesis tests are also called significance tests or statistical tests.

A hypothesis test involves the following procedure. A null hypothesis (or short: null), H0, is formulated. H0 is tested against an alternative hypothesis, Hi. The hypotheses H0 and Hi are mutually exclusive. Ho is a simple null hypothesis if it completely specifies the data generating process. An example would be "X(i) is a Gaussian white-noise process with zero mean and unit standard deviation." H0 is a composite null hypothesis if some parameter of X(i) is unspecified, for example, "Gaussian white-noise process with zero mean." Next, a test statistic, U, is calculated. Any meaningful construction lets U be a function of the data process, U = g ({T(i),X(i)}n=1). On the sample level, u = g ({t(i), x(i)}™=1). In the example H0: "Gaussian white-noise process with ^ = 0" one could take U = X = X(i)/n, the sample mean. U is a random variable with a distribution function, F0(u), where the index "0" indicates that U is computed "under H0," that is, as if H0 were true. F0 (u) is the null distribution. In the example, F0(u) would be Student's t distribution function (Section 3.9). If in the example the alternative were H1: > 0," then a large, positive u value would speak against H0 and for H1. Using F0(u) and plugging in the data {t(i),x(i)}n=1, the one-sided significance probability or one-sided P-value results as

The P-value is the probability that under H0 a value of the test statistic greater than or equal to the observed value, u, is observed. If P is small, then H0 is rejected and H1 accepted, otherwise H0 is accepted and H1 rejected. The two-sided P-value is

In the example, a two-sided test would be indicated for H1: "Gaussian white-noise with ^ = 0." Besides the P-value, a second result of a statistical test is the power. In the one-sided test example:

A type-2 error is accepting H0, although it is a false statement and H1 is true. The probability of a type-2 error is fl = 1 — power. A type-1 error is rejecting H0 against H1, although H0 is true. P, the significance probability, is therefore denoted also as type-1-error probability or false-alarm probability; u is denoted also as false-alarm level.

Although H0 can be a composite null, it is usually more explicit than H1. In climatological practice, the selection of Hi should be guided by prior climatological knowledge. H1 determines also whether a test should be designed as one- or two-sided. For example, if H0 were "no temperature change in a climate model experiment studying the effects of doubled CO2 concentrations, AT = 0," then a one-sided test against H1: "AT > 0" would be appropriate because physics would not let one expect a temperature decrease. Because H1 is normally rather general, it is difficult to quantify the test power. Therefore, more emphasis is put on accurate P-values. Various test statistics, U1,U2,..., may be appropriate for testing H0 against H1. The statistic of choice has for a given data set a small type-1-error probability (small P-value) as first quality criterion. The second quality criterion is a small type-2-error probability (large power), preferably calculated for some realistic, explicit alternative. We can say that a test does not intend to prove that a hypothesis is true but rather that it does try to reject a null hypothesis. A null hypothesis becomes more "reliable" after it has been tested successfully against various realistic alternatives using various data samples, see Popper (1935). It is important that H0 and H1 are established independently of the data to prevent circular reasoning, see von Storch and Zwiers (1999: Section 6.4 therein). As a final general remark, it is more informative to give P-values than to report merely whether they are below certain specified significance levels, say P < 0.1,0.05 or 0.01.

CI A

Figure 3.5. Hypothesis test and confidence interval. The parametric null hypothesis H0: "6 < 0" cannot be rejected against Hi: "6 > 0" with a P-value of a.

When Ho concerns a particular parameter value (U = 9), a CI can be used to derive the P-value (Efron and Tibshirani 1993: Section 15.4 therein). Suppose that a test observes u = 0 < 0. Then select a such that the upper CI bound equals zero. Nominally, prob(9 > 0) = a (Fig. 3.5). This gives a P-value of a for the test of H0: "9 < 0" against H1: "9 > 0." An example from a bivariate setting with data {x(i), y(i)}™=1 would be the comparison of means and . If the CI at level 1 — 2a for the absolute value of the difference of means, — |, does contain zero, then H0: = " cannot be rejected against H1: = " at the level p = 1 — 2a in this two-sided test. A criticism to this CI method of hypothesis testing would be that the CIs are not necessarily constructed as if H0 were true. There might be scale changes and F0(u) depend on H0. However, the BCa CI provides a correction to this effect (Efron and Tibshirani 1993: p. 216 therein). Another option would be to construct a test statistic, U, such that F0(u) is the same for all H0. Such a statistic is called a pivot.

Davison and Hinkley (1997: Chapter 4 therein) explain construction of hypothesis tests by approximating F0(u) with F0(u) obtained from bootstrap resampling or the bootstrap surrogate data approach (Section 3.3.3). The permutation test, developed in the 1930s (Edgington 1986), is the bootstrap test with the difference that no replacement is done for drawing the random samples. This book here puts more emphasis on bootstrap CIs than on bootstrap hypothesis test because CIs contain more quantitative information. We subscribe to Efron and Tib-shirani's (1993: p. 218 therein) view that "hypothesis tests tend to be overused and confidence intervals underused in statistical applications."

An illustrative example is the case where 0 is the anthropogenic signal proportion in the increase of the global temperature over the past 150 years. Specifically, 0 can be defined as ATwith — ATWithout, where ATwith is the temperature change calculated using an AOGCM and taking human activities such as fossil fuel consumption into account, and ATwithout is the temperature change without the effects of human activities ("control run"). Hasselmann (1993) and Hegerl et al. (1996) developed the "fingerprint" approach to derive a powerful test statistic from the high-dimensional, gridded AOGCM output, and showed that H0: "0 = 0" can be rejected against H1: "0 > 0." One task was to quantify the natural temperature variability in the temporal and spatial domains, in order to derive the null distribution. This is difficult because the observed variability contains both natural and anthropogenic portions. It was solved using AOGCM experiments without simulated anthropogenic forcings and a surrogate data approach (Section 3.3.3), that means, several control runs with perturbed initial conditions. It is evident that an estimate, 0, with confidence interval, CD, 0 , for

the anthropogenic signal proportion would mean a step further towards quantification.

3.7 Notation

Table 3.6 summarizes the notation.

3.7 Notation Table 3.6. Notation.

 X (T ) Climate variable, continuous time, process level Xtrend (T) Trend component, continuous time, process level Xout(T ) Outlier component, continuous time, process level S(T ) Variability, continuous time Xnoise (T) Noise component, continuous time, process level T Continuous time X (i) Climate variable, discrete time, process level Xtrend (i) Trend component, discrete time, process level Xout (i) Outlier component, discrete time, process level S(i) Variability, discrete time Xnoise (i) Noise component, discrete time, process level T(i) Discrete time i Index j Index Gaussian noise process with mean ^ and standard deviation a, discrete time x(i) Climate variable, discrete time, sample level t(i) Discrete time, sample level {t(i),x(i)}?=1 Data or sample, discrete time series d(i) Time spacing, sample level d Average time spacing, sample level n Data size e (Climate) parameter e Estimator of (climate) parameter, process and sample levels, esti- mate e>i,e>2 Other estimators PDF Probability density function f (e) PDF of 9 F (•) Probability distribution function F-1(0 Inverse probability distribution function Femp ) Empirical distribution function Expectation operator VAR(0 Variance operator ffO) Function r(0 Gamma function NINT (•) Nearest integer function seê Standard error of 9 biasg Bias of 9 RMS Eg Root mean squared error of 9 CVg Coefficient of variation of 9
 CI Confidence interval CI0,1-2a Confidence interval for 0 of level 1 — 2a 0l Lower bound of CI for 0 On Upper bound of CI for 0 Yi Coverage, below lower CI bound Yu Coverage, above upper CI bound Y Coverage of CI ß Mean 0 Mean estimator X Sample mean, process level x Sample mean, sample level Ys Coverage of CIj;,i_2a a Standard deviation 0 Standard deviation estimator Sn-1 Sample standard deviation, process level Sn-1 Sample standard deviation, sample level Ys„-i Coverage of CISn-1,i_2a z(ß) = Zß Percentage point at ß of the standard normal distribution tv (ß) Percentage point at ß of the t distribution function with v degrees of freedom xV (ß) Percentage point at ß of the chi-squared distribution function with v degrees of freedom ß Probability nsim Number of (Monte Carlo) simulations c Constant c Counter C Coverage error O(.) Order of (•) Average AR(1) Autoregressive process of order 1 AR(p) Autoregressive process of order p MA(q) Moving average process of order q ARMA(p, q) Mixed autoregressive moving average process n' Effective data size n Effective data size for mean estimation n'cr2 Effective data size for variance estimation n'p Effective data size for correlation estimation
 a AR(1) autocorrelation parameter (even spacing) a AR(1) autocorrelation parameter (even spacing) estimator a' AR(1) autocorrelation parameter (even spacing) estimator, bias- corrected T AR(1) persistence time (uneven spacing) T AR(1) persistence time (uneven spacing) estimator T' AR(1) persistence time (uneven spacing) estimator, bias-corrected a AR(1) equivalent autocorrelation parameter (uneven spacing) a AR(1) equivalent autocorrelation parameter (uneven spacing) es- timator a AR(1) equivalent autocorrelation parameter (uneven spacing) es- timator, bias-corrected t*,t*(i) Bootstrap version of discrete time, sample level t*b(i) Indexed bootstrap version of discrete time, sample level b = 1,...,B Index B Number of bootstrap resamples x*,x*(i) Bootstrap version of climate variable, discrete time, sample level Xb(i) Indexed bootstrap version of climate variable, discrete time, sam- ple level d* (i) Bootstrap version of time spacing, sample level {f(i),x*(i)}?=1 Bootstrap resample T* Bootstrap replication T*b Indexed bootstrap replication MBB Moving block bootstrap ARB Autoregressive bootstrap NBB Non-overlapping block bootstrap CBB Circular block bootstrap SB Stationary bootstrap MaBB Matched-block bootstrap TaBB Tapered block bootstrap l Block length lopt Optimal block length lsearch Block length search value Y (i) Variable (1opt selector after Buhlmann and Kiinsch (1999)) IF(X (i)) Estimated influence function R(h) Function (1opt selector after Buhlmann and Kiinsch (1999)) R(h) Autocovariance estimator (Chapter 2) P(h) Autocorrelation estimator (Chapter 2) h Lag bo,bi,b2,b3,bi ,b Parameters (1opt selector after Buhlmann and Kiinsch (1999))
 wsc(0 Split-cosine window wth(-) Tukey-Hanning window z Auxiliary variable CVd Coefficient of variation of the spacing ß Percentage of drawable blocks (adaption of MBB to uneven spac- ing) ^trend (i) Estimated trend component, discrete time, sample level iout (i) Estimated outlier component, discrete time, sample level S(i) Estimated variability, discrete time r(i) Residual of climate equation, discrete time (Eq. 1.2) r* (i) Bootstrap version of residual of climate equation, discrete time (Eq. 1.2) e(i) White-noise residual, discrete time 6(i) Centred white-noise residual, discrete time ?(i) Bootstrap version of centred white-noise residual, discrete time a'(i) Abbreviation (ARB algorithm) BCa CI Bias-corrected and accelerated CI ABC CI Approximate BCa CI %* Estimated bootstrap standard error 9*(a) Percentage point at a of the empirical distribution of 0* a1, a2 Other a values 2o Bias correction a Acceleration #{} Number of cases j Jackknife value of 0 Ho Null hypothesis Hi Alternative hypothesis U Test statistic, process level u Test statistic, sample level (u is also denoted as false-alarm level) Ui,U2 Other test statistics, process level Fo(u) Null distribution Fo(u) Estimated null distribution P P-value, probability of a type-1 error or false-alarm probability ß Probability of a type-2 error M Median M Sample median, process level m Sample median, sample level X '(i) Size-sorted X (i) e Small value

3.8 Background material Table 3.6. Notation (continued).

Tfb (A) Indexed lower bootstrap CI bound over a grid of confidence levels

A Variable, determines confidence level

T(A) Empirical probability (bootstrap calibration)

y,p0,p1,p2,p3,p4, Parameters (z(ff) approximation) 90,91,92,93,94

u,v,w Parameters (error function approximation)

b, 5 Parameters (lognormal distribution)

p, 9 Parameters (geometric distribution)

Z Set of whole numbers

S Set of numbers

AOGCM Atmosphere-Ocean General Circulation Model

MIS Marine isotope stage (sometimes also loosely used for marine isotope substage)

AT Modelled temperature change

ATwith Modelled temperature change, with fossil fuel consumption

ATwithout Modelled temperature change, without fossil fuel consumption

3.8 Background material

We use RMSE instead of the mean squared error (given by RMSE|). RMSE, with the same units as the data, is a handy parameter.

Standard deviation estimation for Gaussian white noise seems to have raised more interest in previous decades than today, as the discussion from 1968 in the journal The American Statistician illustrates (Cureton 1968b,a; Bolch 1968; Markowitz 1968a,b; Jarrett 1968). For example, the choice <r = c■ Sn-1, with c given by Eq. (3.24), yields minimal RMSEj among all a estimators for Gaussian white noise (Goodman 1953). Or, a = c-1 ■ Sn-1 yields biasi = 0 for Gaussian white noise, see for example Holtzman (1950). Today, it appears for practical purposes rather arbitrary whether or not to scale Sn-1, or whether to use n — 1 or n. The resulting differences are likely much smaller than the effects of violations of the Gaussian assumption.

The median of a distribution is defined via F(M) = 0.5. (F(■) is the distribution function, see Eq. (3.49).) The sample median as estimator of M is on the process level

where X'(i) are the size-sorted X(i). On the sample level, m results from using x(i).

A robust estimation procedure "performs well not only under ideal conditions, the model assumptions that have been postulated, but also under departures of from the ideal" (Bickel 1988). In the context of this book, the assumptions regard distributional shape, persistence and spacing; the performance regards an estimator and its properties such as RMSE or CI coverage accuracy. Under ideal conditions, robust estimation procedures can be less efficient (have higher se^) than nonrobust procedures. For example, for Gaussianity and n ^ to, sem ^ (n/2)1/2 ■ se^ (Chu 1955). Robust estimators can require sorting operations, which makes it often difficult to deduce their distribution. The term "robust" was coined by Box (1953); relevant papers on robust location estimation include Huber (1964) and Hampel (1985); for more details see Tukey (1977) or Huber (1981). Unfortunately, today's usage of "robust" in the climate research literature is rather arbitrary.

The empirical distribution function of a sample (x(i)}™=1 is given by

Femp(x) is the sample analogue of the theoretical distribution function, for example, Eq. (3.49).

Bootstrap resampling was formally introduced by Efron (1979); this article summarizes also earlier work. Singh (1981) soon recognized that the ordinary bootstrap yields inconsistent results in a setting with serial dependence. A consistent estimator, 0, converges in probability to 0 as n increases. Convergence in probability means lim prob (\d - 0| > e) =0 Ve > 0. (3.44)

Textbooks on bootstrap resampling include those written by Efron and Tibshirani (1993), Davison and Hinkley (1997) and Good (2005). Statistical point estimation is covered by Lehmann and Casella (1998).

The moving block bootstrap or MBB was introduced by Kunsch (1989) and Liu and Singh (1992). The MBB resamples overlapping blocks. Carlstein (1986) had earlier suggested a method (denoted as NBB) that resamples non-overlapping blocks and does not truncate the final block. This may lead to resamples with data size less than n, that means, subsampling (see below). Hall (1985) had already considered overlapping and non-overlapping block methods in the context of spatial data. Buuhlmann (1994) showed that if

1. X(i) is a stationary Gaussian process with short-range dependence,

2. 0 is a smooth function g ({x(i)}) of the data (e.g., the mean is a smooth function, but the median not) and

3. the block length, l, increases with the data size, n, within bounds, l = O 0n1/2-e) , 0 < e < 1/2, then the MBB produces resamples from a process that converges to the data generating process. The MBB is then called asymptotically valid. The questions after the validity and other properties of the MBB and other bootstrap methods under relaxed assumptions (non-Gaussian processes, long-range dependence, etc.) are currently extensively studied in statistical science. For long-range dependence and the sample mean as estimator with an asymptotically Gaussian distribution, MBB can be modified to provide a valid approximation (Lahiri 1993). For longrange dependence and non-Gaussian limit distributions, MBB has to be changed to subsampling one single block (Hall et al. 1998). Block length selection is less explored for long-range dependence; intuitively, a larger length should be used than for short-range dependence. See Berkowitz and Kilian (2000), Biihlmann (2002), Politis (2003), Lahiri (2003) and references cited in these overviews.

Other block length selectors for the MBB and also for other non-parametric bootstrap methods have been proposed. Hall et al. (1995a) gave an iterative method based on subsamples and cross-validation. As regards the subsample size, consult Carlstein et al. (1998: p. 309 therein). Although the convergence properties in the general case are unknown, the method performed well in the Monte Carlo simulations shown. Politis and White (2004) developed a rule that selects block length as two times the smallest integer, after which the autocovari-ance function (Eq. 2.18) "appears negligible." However, for uneven spacing the autocovariance function is not defined and this selector not directly applicable. A related rule, based on the persistence time, t, of the AR(1) process for uneven spacing (Section 2.1.2), would set l = NINT (4 ■ t/(^; Mudelsee (2003) suggested this rule for correlation estimation of bivariate, unevenly spaced time series (Chapter 7).

An MBB adaption to uneven spacing was analysed using a Monte Carlo experiment. The following simple rule was employed. Instead of allowing all n — l + 1 blocks to be drawn for insertion, only the 50% blocks closest (plus ties) in the coefficient of variation of the spacing, CVd, were made drawable. This was applied to mean estimation of a Gaussian AR(1) process. The comparison between this MBB adaption and the ordinary MBB was made in terms of coverage accuracy and average CI length (Table 3.7). The experiment used the BCa CI and employed the block length selector after Eq. (3.28) for the MBB and its adaption. The result (Table 3.7) exhibits a reduced coverage accuracy of the MBB adaption. The following deficit outweighed the advantage of the adaption (increased similarity of CVd between sample and resample).

Table 3.7. Monte Carlo experiment, moving block bootstrap adaption to uneven spacing. nsim = 47,500 random samples were generated from the Gaussian AR(1) process, {X(i)}™=1, after Eq. (2.9) with t = 1. The start was set to t(1) = 1; the time spacing, d(i), was drawn from a gamma distribution (Eq. 2.48) with order parameter 16, that means, a distribution with a coefficient of variation equal to (16)-1/2 = 0.25, and subsequently scaled to d = 1. Bootstrap BCa CIs for the estimated mean were constructed with B = 1999 and a = 0.025. The ordinary MBB resampling algorithm was compared with an MBB adaption to uneven spacing. The adaption made draw-able only the 50% blocks closest (plus ties) in the coefficient of variation of the spacing. Both the MBB and its adaption to uneven spacing yield clearly larger coverage errors than the ARB because in that Monte Carlo experiment (Table 3.5) the prescribed AR(1) dependence matches the assumption made by the ARB (Section 3.3.2).

 n Yx Nominal ( CI length }b Resampling method Resampling method MBB Adapted MBB MBB Adapted MBB 10 0.591 0.623 0.950 0.836 0.864 20 0.799 0.788 0.950 0.915 0.890 50 0.874 0.861 0.950 0.685 0.672 100 0.901 0.888 0.950 0.510 0.505 200 0.913 0.903 0.950 0.374 0.372 500 0.929 0.920 0.950 0.244 0.244 1000 0.935 0.923 0.950 0.176 0.175

a Standard error of is nominally 0.001. b Average value over nsim simulations.

a Standard error of is nominally 0.001. b Average value over nsim simulations.

Reducing the drawable blocks to 50% reduced, in comparison with the ordinary MBB, the variation between resamples. This in turn reduced the variation between the replications (sample means of resamples). This led to narrower CIs from the adapted MBB algorithm (last two columns in Table 3.7). The CIs from the adapted MBB, finally, contained the true p value less often than the CIs from the ordinary MBB. This means a reduced accuracy because the empirical coverages were in this case of mean estimation always less than the nominal value.

Other nonparametric bootstrap resampling methods than the MBB have been proposed. The circular block bootstrap (CBB) (Politis and Romano 1992a) "wraps" the data {x(i)}™=1 around a circle such that x(n) (Algorithm 3.1) has a successor, x(1). The CBB then resamples overlapping blocks of length l from this periodic structure. That overcomes the deficit of the MBB that data near the edges, x(1) or x(n), have a lower probability to be resampled than data in the centre.

Also the stationary bootstrap (SB) (Politis and Romano 1994) uses the periodic structure to ensure stationarity of the resampling process. Also the SB uses overlapping blocks—however, the block length is not constant but geometrically distributed. Similar selectors as for the MBB (Section 3.3.1.1) can be used for adjusting the average block length. As regards the choice among MBB, NBB, CBB and SB, Lahiri (1999) showed that (1) overlapping blocks (MBB, CBB, SB) are better than non-overlapping blocks (NBB) in terms of RMSE of estimation of variance and related quantities like bootstrap standard error and (2) nonran-dom block lengths (MBB, CBB) are, under the same criterion, at least as good as random block lengths (SB). For estimation of the distribution function and related quantities like CI points, less is known but there are indications that also here MBB and CBB perform better (Lahiri 2003: Chapter 5 therein). Some recent developments are the following. The matched-block bootstrap (MaBB) (Carlstein et al. 1998) introduces dependence between blocks to reduce bias in the bootstrap variance by imposing probability rules. One rule prefers resampling blocks such that block values at the endpoints, where the blocks are concatenated, show a higher agreement than under the MBB. The tapered block bootstrap (TaBB) (Paparoditis and Politis 2001) tapers (weights) data by means of a function before concatenating blocks. The idea is to give reduced weight to data near the block endpoints. This could make the TaBB have lower estimation bias than MBB or CBB (Paparoditis and Politis 2001). Advanced block bootstrap methods could be better than MBB for analysing equidistant climate time series, especially in the case of the MaBB, which shows good theoretical and simulation results when X(i) is an AR(p) process (Carlstein et al. 1998). For uneven spacing, it could be more important to enhance MBB by matching blocks in terms of their spacing structure. This point deserves further study by means of Monte Carlo experiments. Subsampling refers to a procedure where the bootstrap resample size is less than the data size. NBB can lead to subsampling. Also the jackknife (Efron 1979), where l = n — 1 and one block only is resampled, is a subsampling variant. A detailed account is given by Politis et al. (1999). We finally mention the wild bootstrap, which attempts to reconstruct the distribution of a residual r(i) (Eq. 3.29) by means of a two-point distribution (Wu 1986; Hardle and Marron 1991). The adaption of the wild bootstrap to nonparametric autoregression by Neumann and Kreiss (1998) has not yet been extended to uneven spacing, however.

The autoregressive bootstrap or ARB has been developed in the early 1980s; relevant early papers include Freedman and Peters (1984), Peters and Freedman (1984), Efron and Tibshirani (1986) and Find-

ley (1986). Then Bose (1988) showed second-order correctness of the ARB method for estimating stationary AR(p) models—not necessarily Gaussian—and even spacing. Validity of the ARB for nonstationary AR(p) models (e.g., random walk or unit-root processes) requires sub-sampling, that is, drawing less than n resamples at Step 12 of the ARB (Algorithm 3.4), see Lahiri (2003: Chapter 8 therein). The ARB was extended to stationary ARMA(p, q) models with even spacing by Kreiss and Franke (1992). It seems difficult to generate theoretical knowledge about ARB performance for time series models with uneven spacing.

Other parametric bootstrap resampling methods than the ARB have been proposed. The sieve bootstrap (Kreiss 1992; Buhlmann 1997) assumes an AR(to) process. Because of the high number of terms, this model is highly flexible and can approximate other persistence models than the AR(p) with p < to. Therefore the sieve bootstrap could also be called a semi-parametric bootstrap method. The deficit of this method regarding application to climate time series is that it is restricted to even spacing. The parametric bootstrap for Gaussian ARFIMA processes was shown to yield similar asymptotic coverage errors of CIs for covariance estimation as in the case of independent processes (Andrews and Lieberman 2002).

The frequency-domain bootstrap is explained in Chapter 5.

The surrogate data approach comes from dynamical systems theory in physics (Theiler et al. 1992). Contrary to the assertion in the review on surrogate time series by Schreiber and Schmitz (2000: p. 352 therein), this approach is not the common choice in the bootstrap literature. The same as the surrogate data approach is the so-called Monte Carlo approach (Press et al. 1992: Section 15.6 therein).

Bootstrap CIs, their construction and statistical properties are reviewed in the above mentioned textbooks and by DiCiccio and Efron (1996) and Carpenter and Bithell (2000). The challenging question "why not replace [CIs] with more informative tools?" has been raised by Hall and Martin (1996: p. 213 therein). This is based on their criticism that "the process of setting confidence intervals merely picks two points off a bootstrap histogram, ignoring much relevant information about shape and other important features." It has yet to be seen whether graphical tools such as those described by Hall and Martin (1996) will be accepted by the scientific communities. The percentile CI was proposed by Efron (1979), the BCa CI by Efron (1987). A numerical approximation to the BCa interval, called ABC interval, was introduced by DiCiccio and Efron (1992). See Section 3.9 on numerical issues concerning construc tion of BCa intervals. Gotze and Kiinsch (1996) show the second-order correctness of BCa CIs for various estimators and the MBB for serially dependent processes. Hall (1988) determined theoretical coverage accuracies of various bootstrap CI types for estimators that are smooth functions of the data. Bootstrap-t CIs are formed using the standard error, se~, of a single bootstrap replication (Efron and Tibshirani 1993). o _

For simple estimators like /2 = X, plug-in estimates can be used instead of se~. However, for more complex estimators, no plug-in estimates are o at hand. A second bootstrap loop (bootstrapping from bootstrap samples) had to be invoked, which would increase computing costs.

Bootstrap calibration can strongly increase CI coverage accuracy. Consider that a single CI point is sought, say, the lower bound, 0\, for an estimate, 0. Let the bound be calculated for each bootstrap sample, b = 1,..., B, and over a grid of confidence levels, for example,

For each A, compute

Finally, solve p0(A) = a for A. In case a two-sided, equi-tailed CI is sought, the calibration curve p0(A) = 1 — 2a, where

is solved for A. To calculate the CI points for a bootstrap sample requires to perform a second bootstrap-estimation loop. Analysing second-loop bootstrap methods like calibration or bootstrap-t interval construction may require enormous computing costs. Relevant papers on calibrated bootstrap CIs include Hall (1986), Loh (1987, 1991), Hall and Martin (1988), Martin (1990) and Booth and Hall (1994). Regarding the context of resampling data from serially dependent processes, Choi and Hall (2000) report that the sieve or AR(ro) bootstrap has a significantly better performance than blocking methods in CI calibration. However, the sieve bootstrap is not applicable to unevenly spaced time series. This books presents a Monte Carlo experiment on calibrated bootstrap CIs for correlation estimation (Chapter 7), with satisfying coverage performance despite the used MBB resampling.

Bootstrap hypothesis tests are detailed by Davison and Hinkley (1997: Chapter 4 therein), see also Efron and Tibshirani (1993: Chapter 15 therein) and Lehmann and Romano (2005: Chapter 15 therein).

The relation between making a test statistic pivotal and bootstrap CI calibration is described by Beran (1987, 1988). Guidelines for bootstrap hypothesis testing are provided by Hall and Wilson (1991). An extension of MBB hypothesis testing of the mean from univariate to multivariate time series has been presented by Wilks (1997). The dimensionality may be rather high, and the method may therefore be applicable to time-dependent climate fields such as gridded temperature output from a mathematical climate model. Beersma and Buishand (1999) compare variances of bivariate time series using jackknife resampling. They find significantly higher variability of future northern European precipitation amounts in the computer simulation with elevated greenhouse gas concentrations than in the simulation without (control run). Huybers and Wunsch (2005) test the hypothesis that Earth's obliquity variations influence glacial terminations during the late Pleistocene using parametric resampling of the timescale (Section 4.1.7).

Multiple hypothesis tests may be performed when analysing a hypothesis that consists of several sub-hypotheses. This situation arises in spectrum estimation (Chapter 5), where a range of frequencies is examined. The traditional method is adjusting the P-values of the individual tests to yield the desired overall P-value. A recent paper (Storey 2007: p. 347 therein) states "that one can improve the overall performance of multiple significance tests by borrowing information across all the tests when assessing the relative significance of each one, rather than calculating P-values for each test individually."

The anthropogenic warming signal has stimulated much work applying various types of hypothesis tests using measured and AOGCM temperature data. More details on the fingerprint approach are contained in the following papers: Hasselmann (1997), Hegerl and North (1997) and Hegerl et al. (1997). Correlation approaches to detect the anthropogenic warming signal are described by Folland et al. (1998) and Wigley et al. (2000). A recent overview is given by Barnett et al. (2005).

3.9 Technical issues

The standard normal (Gaussian) distribution has following PDF:

Figure 3.1 shows the distributional shape. The distribution function, f (x) = (2n)-1/2 exp (—x2/2) .

Figure 3.1 shows the distributional shape. The distribution function, x

cannot be expressed in closed analytical form. We use

where for x > 0 the complementary error function, erfcc, is approximated (Press et al. 1992: Section 6.2 therein) via erfcc(u) w v exp(-w2 - 1.26551223 + v (1.00002368 + v (0.37409196 + v (0.09678418 + v (-0.18628806 + v (0.27886807 + v (-1.13520398 + v (1.48851587

For x < 0, use the symmetry, F(-x) = 1 - F(x). For all x, this approximation has a relative error of less than 1.2 ■ 10-7 (Press et al. 1992). The inverse function of F(x) defines the percentage point on the x axis, z(ft), with 0 < ft < 1. Approximations are used for calculating z(ft); for the Monte Carlo simulation experiments in this book, the formula given by Odeh and Evans (1974) is employed:

where and

0 0