[0o z1 a
F(■) is the standard normal distribution function (Eq. 3.49). e0, the bias correction, is computed as
where #{0*b < d} means the number of replications where d*b < d and F1() is the inverse function of F(■). The acceleration, a, is computed (Efron and Tibshirani 1993) as
Era j=1
where Oj is the jackknife value of 0. Consider the original sample with the jth point removed, that is, {t(i),x(i)} ,i = 1,...,n,i = j. The jackknife value is then the value of 0 calculated using this sample of reduced size. The average, , is given by e(j) / n.
e0 corrects for the median estimation bias; for example, if just half of the replications have 0*b < 0, then e0 = 0. The acceleration, a, takes into account scale effects, which arise when the standard error of 0 itself depends on the true parameter value, 0.
3.5 Examples
In the first, theoretical example, we compare classical and bootstrap CIs in terms of coverage accuracy (Table 3.5). The mean of AR(1) processes with uneven spacing was estimated for two distributional shapes, normal and lognormal. The classical CI employed the effective data size for mean estimation, the bootstrap CI used the ARB algorithm and the BCa method.
The classical CI performed better for the normal than for the lognormal shape. This is because the normal assumption made at CI construction is violated in the case of the lognormal shape. With increasing data size, the lognormal approaches the normal distribution (Johnson et al. 1994: Chapter 14 therein) and the difference in performance decreases. However, this difference is still significant for n = 1000 in the example.
Also the bootstrap CI performed better for the normal than for the lognormal shape. This may be because persistence time estimation (0) and persistence time bias correction (0') is less accurate for nonnormally distributed data.
Table 3.5. Monte Carlo experiment, mean estimation of AR(1) noise processes with uneven spacing, normal and lognormal shape. nsim = 47,500 random samples were generated from the Gaussian AR(1) process, {X(i)}n_1, after Eq. (2.9) with t = 1. The samples from the lognormal AR(1) process were generated by taking exp [X (i)]. The start was set to t(1) = 1; the time spacing, d(i), was drawn from a gamma distribution (Eq. 2.48) with order parameter 16, that means, a distribution with a coefficient of variation equal to (16)1/2 = 0.25, and subsequently scaled to d = 1. Two CI types for the estimated mean were constructed, classical and bootstrap. The classical CI employed n^ calculated from Eq. (2.7) with a = exp (—d/r') plugged in for a, and the t distribution (Eq. 3.18). The bootstrap CI used the ARB (Algorithm 3.5) and the BCa method (Section 3.4.4) with B = 1999 and a = 0.025.
n 
Distribution 
Nominal  
Normal CI type 
Lognormal CI type  
Classical 
Bootstrap 
Classical 
Bootstrap  
10 
0.918 
0.863 
0.835 
0.789 
0.950 
20 
0.929 
0.903 
0.845 
0.845 
0.950 
50 
0.938 
0.929 
0.876 
0.888 
0.950 
100 
0.943 
0.941 
0.897 
0.909 
0.950 
200 
0.942 
0.943 
0.914 
0.922 
0.950 
500 
0.947 
0.948 
0.926 
0.930 
0.950 
1000 
0.947 
0.949 
0.933 
0.937 
0.950 
a Standard error of Yt is nominally 0.001.
a Standard error of Yt is nominally 0.001.
For small sample sizes (n ^ 50 (normal distribution) or n ^ 20 (lognormal distribution)) the classical CI performed better than the bootstrap CI. This advantage is likely in part owing to the fact that a formula for the effective data size for mean estimation is known; it may disappear for more complex estimators, where no formula for the effective data size exists. For larger sample sizes (n ^ 100 (normal distribution) or n ^ 50 (lognormal distribution)) the bootstrap CI is as good as the classical CI (normal shape) or better (lognormal shape).
In the second, practical example, Fig. 3.4 shows the transition from a glacial (MIS 6) to the last interglacial (MIS 5) in the Vostok CO2 record. The mean CO2 concentration was estimated for the time intervals from 140 to 177ka (glacial) and from 115 to 130ka (interglacial). Student's t CIs (Section 3.4.2) were constructed using nonparametric stationary bootstrap resampling, a variant of the MBB, where the block length is
Figure 3.4. Determination of mean CO2 levels in the Vostok record (Fig. 1.3b) during a glacial and an interglacial. The interval from 140 to 177 ka represents the glacial (MIS 6), the interval from 115 to 130 ky the interglacial (marine isotope substage 5.5). The 95% bootstrap CIs for the estimated means are shown as shaded bars.
Figure 3.4. Determination of mean CO2 levels in the Vostok record (Fig. 1.3b) during a glacial and an interglacial. The interval from 140 to 177 ka represents the glacial (MIS 6), the interval from 115 to 130 ky the interglacial (marine isotope substage 5.5). The 95% bootstrap CIs for the estimated means are shown as shaded bars.
not constant (Section 3.8). The number of resamples was B = 2000. The average block length was adjusted to NINT (4 ■ r/d).
The mean glacial CO2 level was determined as 192.8 ppmv with 95% CI [188.3 ppmv; 197.3 ppmv]; the mean interglacial CO2 level was 271.9 ppmv with 95% CI [268.8 ppmv; 275.0 ppmv]. Because of the reduced data sizes in the intervals (glacial, n = 13; interglacial, n = 24), also the accuracies of the CIs may be reduced. The enormous glacialinterglacial amplitude in CO2 documents the importance of this greenhouse gas for late Pleistocene climate changes, the ice age. The relation between CO2 and temperature changes is analysed in Chapters 7 and 8.
3.6 Bootstrap hypothesis tests
By the analysis of climate time series, {t(i), x(i)}™=1, we make, generally speaking, a statistical inference of properties of the climate system. One type of inference is estimation of a climate parameter, 0. In addition to a point estimate, 0, an interval estimate, CI^12a, helps to assess how accurate 0 is. The bootstrap is used to construct CIs in complex situations regarding data properties shape, persistence and spacing. The second type of inference is testing a hypothesis, a statement about the climate system, using the data sample. Again, this can be a difficult task (shape, persistence, spacing), and again, the bootstrap can be a power ful tool in such a situation. Hypothesis tests are also called significance tests or statistical tests.
A hypothesis test involves the following procedure. A null hypothesis (or short: null), H0, is formulated. H0 is tested against an alternative hypothesis, Hi. The hypotheses H0 and Hi are mutually exclusive. Ho is a simple null hypothesis if it completely specifies the data generating process. An example would be "X(i) is a Gaussian whitenoise process with zero mean and unit standard deviation." H0 is a composite null hypothesis if some parameter of X(i) is unspecified, for example, "Gaussian whitenoise process with zero mean." Next, a test statistic, U, is calculated. Any meaningful construction lets U be a function of the data process, U = g ({T(i),X(i)}n=1). On the sample level, u = g ({t(i), x(i)}™=1). In the example H0: "Gaussian whitenoise process with ^ = 0" one could take U = X = X(i)/n, the sample mean. U is a random variable with a distribution function, F0(u), where the index "0" indicates that U is computed "under H0," that is, as if H0 were true. F0 (u) is the null distribution. In the example, F0(u) would be Student's t distribution function (Section 3.9). If in the example the alternative were H1: > 0," then a large, positive u value would speak against H0 and for H1. Using F0(u) and plugging in the data {t(i),x(i)}n=1, the onesided significance probability or onesided Pvalue results as
The Pvalue is the probability that under H0 a value of the test statistic greater than or equal to the observed value, u, is observed. If P is small, then H0 is rejected and H1 accepted, otherwise H0 is accepted and H1 rejected. The twosided Pvalue is
In the example, a twosided test would be indicated for H1: "Gaussian whitenoise with ^ = 0." Besides the Pvalue, a second result of a statistical test is the power. In the onesided test example:
A type2 error is accepting H0, although it is a false statement and H1 is true. The probability of a type2 error is fl = 1 — power. A type1 error is rejecting H0 against H1, although H0 is true. P, the significance probability, is therefore denoted also as type1error probability or falsealarm probability; u is denoted also as falsealarm level.
Although H0 can be a composite null, it is usually more explicit than H1. In climatological practice, the selection of Hi should be guided by prior climatological knowledge. H1 determines also whether a test should be designed as one or twosided. For example, if H0 were "no temperature change in a climate model experiment studying the effects of doubled CO2 concentrations, AT = 0," then a onesided test against H1: "AT > 0" would be appropriate because physics would not let one expect a temperature decrease. Because H1 is normally rather general, it is difficult to quantify the test power. Therefore, more emphasis is put on accurate Pvalues. Various test statistics, U1,U2,..., may be appropriate for testing H0 against H1. The statistic of choice has for a given data set a small type1error probability (small Pvalue) as first quality criterion. The second quality criterion is a small type2error probability (large power), preferably calculated for some realistic, explicit alternative. We can say that a test does not intend to prove that a hypothesis is true but rather that it does try to reject a null hypothesis. A null hypothesis becomes more "reliable" after it has been tested successfully against various realistic alternatives using various data samples, see Popper (1935). It is important that H0 and H1 are established independently of the data to prevent circular reasoning, see von Storch and Zwiers (1999: Section 6.4 therein). As a final general remark, it is more informative to give Pvalues than to report merely whether they are below certain specified significance levels, say P < 0.1,0.05 or 0.01.
CI A
Figure 3.5. Hypothesis test and confidence interval. The parametric null hypothesis H0: "6 < 0" cannot be rejected against Hi: "6 > 0" with a Pvalue of a.
When Ho concerns a particular parameter value (U = 9), a CI can be used to derive the Pvalue (Efron and Tibshirani 1993: Section 15.4 therein). Suppose that a test observes u = 0 < 0. Then select a such that the upper CI bound equals zero. Nominally, prob(9 > 0) = a (Fig. 3.5). This gives a Pvalue of a for the test of H0: "9 < 0" against H1: "9 > 0." An example from a bivariate setting with data {x(i), y(i)}™=1 would be the comparison of means and . If the CI at level 1 — 2a for the absolute value of the difference of means, — , does contain zero, then H0: = " cannot be rejected against H1: = " at the level p = 1 — 2a in this twosided test. A criticism to this CI method of hypothesis testing would be that the CIs are not necessarily constructed as if H0 were true. There might be scale changes and F0(u) depend on H0. However, the BCa CI provides a correction to this effect (Efron and Tibshirani 1993: p. 216 therein). Another option would be to construct a test statistic, U, such that F0(u) is the same for all H0. Such a statistic is called a pivot.
Davison and Hinkley (1997: Chapter 4 therein) explain construction of hypothesis tests by approximating F0(u) with F0(u) obtained from bootstrap resampling or the bootstrap surrogate data approach (Section 3.3.3). The permutation test, developed in the 1930s (Edgington 1986), is the bootstrap test with the difference that no replacement is done for drawing the random samples. This book here puts more emphasis on bootstrap CIs than on bootstrap hypothesis test because CIs contain more quantitative information. We subscribe to Efron and Tibshirani's (1993: p. 218 therein) view that "hypothesis tests tend to be overused and confidence intervals underused in statistical applications."
An illustrative example is the case where 0 is the anthropogenic signal proportion in the increase of the global temperature over the past 150 years. Specifically, 0 can be defined as ATwith — ATWithout, where ATwith is the temperature change calculated using an AOGCM and taking human activities such as fossil fuel consumption into account, and ATwithout is the temperature change without the effects of human activities ("control run"). Hasselmann (1993) and Hegerl et al. (1996) developed the "fingerprint" approach to derive a powerful test statistic from the highdimensional, gridded AOGCM output, and showed that H0: "0 = 0" can be rejected against H1: "0 > 0." One task was to quantify the natural temperature variability in the temporal and spatial domains, in order to derive the null distribution. This is difficult because the observed variability contains both natural and anthropogenic portions. It was solved using AOGCM experiments without simulated anthropogenic forcings and a surrogate data approach (Section 3.3.3), that means, several control runs with perturbed initial conditions. It is evident that an estimate, 0, with confidence interval, CD, 0 , for
the anthropogenic signal proportion would mean a step further towards quantification.
3.7 Notation
Table 3.6 summarizes the notation.
3.7 Notation Table 3.6. Notation.
X (T ) 
Climate variable, continuous time, process level 
Xtrend (T) 
Trend component, continuous time, process level 
Xout(T ) 
Outlier component, continuous time, process level 
S(T ) 
Variability, continuous time 
Xnoise (T) 
Noise component, continuous time, process level 
T 
Continuous time 
X (i) 
Climate variable, discrete time, process level 
Xtrend (i) 
Trend component, discrete time, process level 
Xout (i) 
Outlier component, discrete time, process level 
S(i) 
Variability, discrete time 
Xnoise (i) 
Noise component, discrete time, process level 
T(i) 
Discrete time 
i 
Index 
j 
Index 
Gaussian noise process with mean ^ and standard deviation a,  
discrete time  
x(i) 
Climate variable, discrete time, sample level 
t(i) 
Discrete time, sample level 
{t(i),x(i)}?=1 
Data or sample, discrete time series 
d(i) 
Time spacing, sample level 
d 
Average time spacing, sample level 
n 
Data size 
e 
(Climate) parameter 
e 
Estimator of (climate) parameter, process and sample levels, esti 
mate  
e>i,e>2 
Other estimators 

Probability density function 
f (e) 
PDF of 9 
F (•) 
Probability distribution function 
F1(0 
Inverse probability distribution function 
Femp ) 
Empirical distribution function 
Expectation operator  
VAR(0 
Variance operator 
ffO) 
Function 
r(0 
Gamma function 
NINT (•) 
Nearest integer function 
seê 
Standard error of 9 
biasg 
Bias of 9 
RMS Eg 
Root mean squared error of 9 
CVg 
Coefficient of variation of 9 
CI 
Confidence interval 
CI0,12a 
Confidence interval for 0 of level 1 — 2a 
0l 
Lower bound of CI for 0 
On 
Upper bound of CI for 0 
Yi 
Coverage, below lower CI bound 
Yu 
Coverage, above upper CI bound 
Y 
Coverage of CI 
ß 
Mean 
0 
Mean estimator 
X 
Sample mean, process level 
x 
Sample mean, sample level 
Ys 
Coverage of CIj;,i_2a 
a 
Standard deviation 
0 
Standard deviation estimator 
Sn1 
Sample standard deviation, process level 
Sn1 
Sample standard deviation, sample level 
Ys„i 
Coverage of CISn1,i_2a 
z(ß) = Zß 
Percentage point at ß of the standard normal distribution 
tv (ß) 
Percentage point at ß of the t distribution function with v degrees 
of freedom  
xV (ß) 
Percentage point at ß of the chisquared distribution function with 
v degrees of freedom  
ß 
Probability 
nsim 
Number of (Monte Carlo) simulations 
c 
Constant 
c 
Counter 
C 
Coverage error 
O(.) 
Order of 
(•) 
Average 
AR(1) 
Autoregressive process of order 1 
AR(p) 
Autoregressive process of order p 
MA(q) 
Moving average process of order q 
ARMA(p, q) 
Mixed autoregressive moving average process 
n' 
Effective data size 
n 
Effective data size for mean estimation 
n'cr2 
Effective data size for variance estimation 
n'p 
Effective data size for correlation estimation 
a 
AR(1) autocorrelation parameter (even spacing) 
a 
AR(1) autocorrelation parameter (even spacing) estimator 
a' 
AR(1) autocorrelation parameter (even spacing) estimator, bias 
corrected  
T 
AR(1) persistence time (uneven spacing) 
T 
AR(1) persistence time (uneven spacing) estimator 
T' 
AR(1) persistence time (uneven spacing) estimator, biascorrected 
a 
AR(1) equivalent autocorrelation parameter (uneven spacing) 
a 
AR(1) equivalent autocorrelation parameter (uneven spacing) es 
timator  
a 
AR(1) equivalent autocorrelation parameter (uneven spacing) es 
timator, biascorrected  
t*,t*(i) 
Bootstrap version of discrete time, sample level 
t*b(i) 
Indexed bootstrap version of discrete time, sample level 
b = 1,...,B 
Index 
B 
Number of bootstrap resamples 
x*,x*(i) 
Bootstrap version of climate variable, discrete time, sample level 
Xb(i) 
Indexed bootstrap version of climate variable, discrete time, sam 
ple level  
d* (i) 
Bootstrap version of time spacing, sample level 
{f(i),x*(i)}?=1 
Bootstrap resample 
T* 
Bootstrap replication 
T*b 
Indexed bootstrap replication 
MBB 
Moving block bootstrap 
ARB 
Autoregressive bootstrap 
NBB 
Nonoverlapping block bootstrap 
CBB 
Circular block bootstrap 
SB 
Stationary bootstrap 
MaBB 
Matchedblock bootstrap 
TaBB 
Tapered block bootstrap 
l 
Block length 
lopt 
Optimal block length 
lsearch 
Block length search value 
Y (i) 
Variable (1opt selector after Buhlmann and Kiinsch (1999)) 
IF(X (i)) 
Estimated influence function 
R(h) 
Function (1opt selector after Buhlmann and Kiinsch (1999)) 
R(h) 
Autocovariance estimator (Chapter 2) 
P(h) 
Autocorrelation estimator (Chapter 2) 
h 
Lag 
bo,bi,b2,b3,bi ,b 
Parameters (1opt selector after Buhlmann and Kiinsch (1999)) 
wsc(0 
Splitcosine window 
wth() 
TukeyHanning window 
z 
Auxiliary variable 
CVd 
Coefficient of variation of the spacing 
ß 
Percentage of drawable blocks (adaption of MBB to uneven spac 
ing)  
^trend (i) 
Estimated trend component, discrete time, sample level 
iout (i) 
Estimated outlier component, discrete time, sample level 
S(i) 
Estimated variability, discrete time 
r(i) 
Residual of climate equation, discrete time (Eq. 1.2) 
r* (i) 
Bootstrap version of residual of climate equation, discrete time 
(Eq. 1.2)  
e(i) 
Whitenoise residual, discrete time 
6(i) 
Centred whitenoise residual, discrete time 
?(i) 
Bootstrap version of centred whitenoise residual, discrete time 
a'(i) 
Abbreviation (ARB algorithm) 
BCa CI 
Biascorrected and accelerated CI 
ABC CI 
Approximate BCa CI 
%* 
Estimated bootstrap standard error 
9*(a) 
Percentage point at a of the empirical distribution of 0* 
a1, a2 
Other a values 
2o 
Bias correction 
a 
Acceleration 
#{} 
Number of cases 
j 
Jackknife value of 0 
Ho 
Null hypothesis 
Hi 
Alternative hypothesis 
U 
Test statistic, process level 
u 
Test statistic, sample level (u is also denoted as falsealarm level) 
Ui,U2 
Other test statistics, process level 
Fo(u) 
Null distribution 
Fo(u) 
Estimated null distribution 
P 
Pvalue, probability of a type1 error or falsealarm probability 
ß 
Probability of a type2 error 
M 
Median 
M 
Sample median, process level 
m 
Sample median, sample level 
X '(i) 
Sizesorted X (i) 
e 
Small value 
3.8 Background material Table 3.6. Notation (continued).
Tfb (A) Indexed lower bootstrap CI bound over a grid of confidence levels
A Variable, determines confidence level
T(A) Empirical probability (bootstrap calibration)
y,p0,p1,p2,p3,p4, Parameters (z(ff) approximation) 90,91,92,93,94
u,v,w Parameters (error function approximation)
b, 5 Parameters (lognormal distribution)
p, 9 Parameters (geometric distribution)
Z Set of whole numbers
S Set of numbers
AOGCM AtmosphereOcean General Circulation Model
MIS Marine isotope stage (sometimes also loosely used for marine isotope substage)
AT Modelled temperature change
ATwith Modelled temperature change, with fossil fuel consumption
ATwithout Modelled temperature change, without fossil fuel consumption
3.8 Background material
We use RMSE instead of the mean squared error (given by RMSE). RMSE, with the same units as the data, is a handy parameter.
Standard deviation estimation for Gaussian white noise seems to have raised more interest in previous decades than today, as the discussion from 1968 in the journal The American Statistician illustrates (Cureton 1968b,a; Bolch 1968; Markowitz 1968a,b; Jarrett 1968). For example, the choice <r = c■ Sn1, with c given by Eq. (3.24), yields minimal RMSEj among all a estimators for Gaussian white noise (Goodman 1953). Or, a = c1 ■ Sn1 yields biasi = 0 for Gaussian white noise, see for example Holtzman (1950). Today, it appears for practical purposes rather arbitrary whether or not to scale Sn1, or whether to use n — 1 or n. The resulting differences are likely much smaller than the effects of violations of the Gaussian assumption.
The median of a distribution is defined via F(M) = 0.5. (F(■) is the distribution function, see Eq. (3.49).) The sample median as estimator of M is on the process level
where X'(i) are the sizesorted X(i). On the sample level, m results from using x(i).
A robust estimation procedure "performs well not only under ideal conditions, the model assumptions that have been postulated, but also under departures of from the ideal" (Bickel 1988). In the context of this book, the assumptions regard distributional shape, persistence and spacing; the performance regards an estimator and its properties such as RMSE or CI coverage accuracy. Under ideal conditions, robust estimation procedures can be less efficient (have higher se^) than nonrobust procedures. For example, for Gaussianity and n ^ to, sem ^ (n/2)1/2 ■ se^ (Chu 1955). Robust estimators can require sorting operations, which makes it often difficult to deduce their distribution. The term "robust" was coined by Box (1953); relevant papers on robust location estimation include Huber (1964) and Hampel (1985); for more details see Tukey (1977) or Huber (1981). Unfortunately, today's usage of "robust" in the climate research literature is rather arbitrary.
The empirical distribution function of a sample (x(i)}™=1 is given by
Femp(x) is the sample analogue of the theoretical distribution function, for example, Eq. (3.49).
Bootstrap resampling was formally introduced by Efron (1979); this article summarizes also earlier work. Singh (1981) soon recognized that the ordinary bootstrap yields inconsistent results in a setting with serial dependence. A consistent estimator, 0, converges in probability to 0 as n increases. Convergence in probability means lim prob (\d  0 > e) =0 Ve > 0. (3.44)
Textbooks on bootstrap resampling include those written by Efron and Tibshirani (1993), Davison and Hinkley (1997) and Good (2005). Statistical point estimation is covered by Lehmann and Casella (1998).
The moving block bootstrap or MBB was introduced by Kunsch (1989) and Liu and Singh (1992). The MBB resamples overlapping blocks. Carlstein (1986) had earlier suggested a method (denoted as NBB) that resamples nonoverlapping blocks and does not truncate the final block. This may lead to resamples with data size less than n, that means, subsampling (see below). Hall (1985) had already considered overlapping and nonoverlapping block methods in the context of spatial data. Buuhlmann (1994) showed that if
1. X(i) is a stationary Gaussian process with shortrange dependence,
2. 0 is a smooth function g ({x(i)}) of the data (e.g., the mean is a smooth function, but the median not) and
3. the block length, l, increases with the data size, n, within bounds, l = O 0n1/2e) , 0 < e < 1/2, then the MBB produces resamples from a process that converges to the data generating process. The MBB is then called asymptotically valid. The questions after the validity and other properties of the MBB and other bootstrap methods under relaxed assumptions (nonGaussian processes, longrange dependence, etc.) are currently extensively studied in statistical science. For longrange dependence and the sample mean as estimator with an asymptotically Gaussian distribution, MBB can be modified to provide a valid approximation (Lahiri 1993). For longrange dependence and nonGaussian limit distributions, MBB has to be changed to subsampling one single block (Hall et al. 1998). Block length selection is less explored for longrange dependence; intuitively, a larger length should be used than for shortrange dependence. See Berkowitz and Kilian (2000), Biihlmann (2002), Politis (2003), Lahiri (2003) and references cited in these overviews.
Other block length selectors for the MBB and also for other nonparametric bootstrap methods have been proposed. Hall et al. (1995a) gave an iterative method based on subsamples and crossvalidation. As regards the subsample size, consult Carlstein et al. (1998: p. 309 therein). Although the convergence properties in the general case are unknown, the method performed well in the Monte Carlo simulations shown. Politis and White (2004) developed a rule that selects block length as two times the smallest integer, after which the autocovariance function (Eq. 2.18) "appears negligible." However, for uneven spacing the autocovariance function is not defined and this selector not directly applicable. A related rule, based on the persistence time, t, of the AR(1) process for uneven spacing (Section 2.1.2), would set l = NINT (4 ■ t/(^; Mudelsee (2003) suggested this rule for correlation estimation of bivariate, unevenly spaced time series (Chapter 7).
An MBB adaption to uneven spacing was analysed using a Monte Carlo experiment. The following simple rule was employed. Instead of allowing all n — l + 1 blocks to be drawn for insertion, only the 50% blocks closest (plus ties) in the coefficient of variation of the spacing, CVd, were made drawable. This was applied to mean estimation of a Gaussian AR(1) process. The comparison between this MBB adaption and the ordinary MBB was made in terms of coverage accuracy and average CI length (Table 3.7). The experiment used the BCa CI and employed the block length selector after Eq. (3.28) for the MBB and its adaption. The result (Table 3.7) exhibits a reduced coverage accuracy of the MBB adaption. The following deficit outweighed the advantage of the adaption (increased similarity of CVd between sample and resample).
Table 3.7. Monte Carlo experiment, moving block bootstrap adaption to uneven spacing. nsim = 47,500 random samples were generated from the Gaussian AR(1) process, {X(i)}™=1, after Eq. (2.9) with t = 1. The start was set to t(1) = 1; the time spacing, d(i), was drawn from a gamma distribution (Eq. 2.48) with order parameter 16, that means, a distribution with a coefficient of variation equal to (16)1/2 = 0.25, and subsequently scaled to d = 1. Bootstrap BCa CIs for the estimated mean were constructed with B = 1999 and a = 0.025. The ordinary MBB resampling algorithm was compared with an MBB adaption to uneven spacing. The adaption made drawable only the 50% blocks closest (plus ties) in the coefficient of variation of the spacing. Both the MBB and its adaption to uneven spacing yield clearly larger coverage errors than the ARB because in that Monte Carlo experiment (Table 3.5) the prescribed AR(1) dependence matches the assumption made by the ARB (Section 3.3.2).
n 
Yx 
Nominal 
( CI length }b  
Resampling method 
Resampling method  
MBB 
Adapted MBB 
MBB 
Adapted MBB  
10 
0.591 
0.623 
0.950 
0.836 
0.864 
20 
0.799 
0.788 
0.950 
0.915 
0.890 
50 
0.874 
0.861 
0.950 
0.685 
0.672 
100 
0.901 
0.888 
0.950 
0.510 
0.505 
200 
0.913 
0.903 
0.950 
0.374 
0.372 
500 
0.929 
0.920 
0.950 
0.244 
0.244 
1000 
0.935 
0.923 
0.950 
0.176 
0.175 
a Standard error of is nominally 0.001. b Average value over nsim simulations.
a Standard error of is nominally 0.001. b Average value over nsim simulations.
Reducing the drawable blocks to 50% reduced, in comparison with the ordinary MBB, the variation between resamples. This in turn reduced the variation between the replications (sample means of resamples). This led to narrower CIs from the adapted MBB algorithm (last two columns in Table 3.7). The CIs from the adapted MBB, finally, contained the true p value less often than the CIs from the ordinary MBB. This means a reduced accuracy because the empirical coverages were in this case of mean estimation always less than the nominal value.
Other nonparametric bootstrap resampling methods than the MBB have been proposed. The circular block bootstrap (CBB) (Politis and Romano 1992a) "wraps" the data {x(i)}™=1 around a circle such that x(n) (Algorithm 3.1) has a successor, x(1). The CBB then resamples overlapping blocks of length l from this periodic structure. That overcomes the deficit of the MBB that data near the edges, x(1) or x(n), have a lower probability to be resampled than data in the centre.
Also the stationary bootstrap (SB) (Politis and Romano 1994) uses the periodic structure to ensure stationarity of the resampling process. Also the SB uses overlapping blocks—however, the block length is not constant but geometrically distributed. Similar selectors as for the MBB (Section 3.3.1.1) can be used for adjusting the average block length. As regards the choice among MBB, NBB, CBB and SB, Lahiri (1999) showed that (1) overlapping blocks (MBB, CBB, SB) are better than nonoverlapping blocks (NBB) in terms of RMSE of estimation of variance and related quantities like bootstrap standard error and (2) nonrandom block lengths (MBB, CBB) are, under the same criterion, at least as good as random block lengths (SB). For estimation of the distribution function and related quantities like CI points, less is known but there are indications that also here MBB and CBB perform better (Lahiri 2003: Chapter 5 therein). Some recent developments are the following. The matchedblock bootstrap (MaBB) (Carlstein et al. 1998) introduces dependence between blocks to reduce bias in the bootstrap variance by imposing probability rules. One rule prefers resampling blocks such that block values at the endpoints, where the blocks are concatenated, show a higher agreement than under the MBB. The tapered block bootstrap (TaBB) (Paparoditis and Politis 2001) tapers (weights) data by means of a function before concatenating blocks. The idea is to give reduced weight to data near the block endpoints. This could make the TaBB have lower estimation bias than MBB or CBB (Paparoditis and Politis 2001). Advanced block bootstrap methods could be better than MBB for analysing equidistant climate time series, especially in the case of the MaBB, which shows good theoretical and simulation results when X(i) is an AR(p) process (Carlstein et al. 1998). For uneven spacing, it could be more important to enhance MBB by matching blocks in terms of their spacing structure. This point deserves further study by means of Monte Carlo experiments. Subsampling refers to a procedure where the bootstrap resample size is less than the data size. NBB can lead to subsampling. Also the jackknife (Efron 1979), where l = n — 1 and one block only is resampled, is a subsampling variant. A detailed account is given by Politis et al. (1999). We finally mention the wild bootstrap, which attempts to reconstruct the distribution of a residual r(i) (Eq. 3.29) by means of a twopoint distribution (Wu 1986; Hardle and Marron 1991). The adaption of the wild bootstrap to nonparametric autoregression by Neumann and Kreiss (1998) has not yet been extended to uneven spacing, however.
The autoregressive bootstrap or ARB has been developed in the early 1980s; relevant early papers include Freedman and Peters (1984), Peters and Freedman (1984), Efron and Tibshirani (1986) and Find
ley (1986). Then Bose (1988) showed secondorder correctness of the ARB method for estimating stationary AR(p) models—not necessarily Gaussian—and even spacing. Validity of the ARB for nonstationary AR(p) models (e.g., random walk or unitroot processes) requires subsampling, that is, drawing less than n resamples at Step 12 of the ARB (Algorithm 3.4), see Lahiri (2003: Chapter 8 therein). The ARB was extended to stationary ARMA(p, q) models with even spacing by Kreiss and Franke (1992). It seems difficult to generate theoretical knowledge about ARB performance for time series models with uneven spacing.
Other parametric bootstrap resampling methods than the ARB have been proposed. The sieve bootstrap (Kreiss 1992; Buhlmann 1997) assumes an AR(to) process. Because of the high number of terms, this model is highly flexible and can approximate other persistence models than the AR(p) with p < to. Therefore the sieve bootstrap could also be called a semiparametric bootstrap method. The deficit of this method regarding application to climate time series is that it is restricted to even spacing. The parametric bootstrap for Gaussian ARFIMA processes was shown to yield similar asymptotic coverage errors of CIs for covariance estimation as in the case of independent processes (Andrews and Lieberman 2002).
The frequencydomain bootstrap is explained in Chapter 5.
The surrogate data approach comes from dynamical systems theory in physics (Theiler et al. 1992). Contrary to the assertion in the review on surrogate time series by Schreiber and Schmitz (2000: p. 352 therein), this approach is not the common choice in the bootstrap literature. The same as the surrogate data approach is the socalled Monte Carlo approach (Press et al. 1992: Section 15.6 therein).
Bootstrap CIs, their construction and statistical properties are reviewed in the above mentioned textbooks and by DiCiccio and Efron (1996) and Carpenter and Bithell (2000). The challenging question "why not replace [CIs] with more informative tools?" has been raised by Hall and Martin (1996: p. 213 therein). This is based on their criticism that "the process of setting confidence intervals merely picks two points off a bootstrap histogram, ignoring much relevant information about shape and other important features." It has yet to be seen whether graphical tools such as those described by Hall and Martin (1996) will be accepted by the scientific communities. The percentile CI was proposed by Efron (1979), the BCa CI by Efron (1987). A numerical approximation to the BCa interval, called ABC interval, was introduced by DiCiccio and Efron (1992). See Section 3.9 on numerical issues concerning construc tion of BCa intervals. Gotze and Kiinsch (1996) show the secondorder correctness of BCa CIs for various estimators and the MBB for serially dependent processes. Hall (1988) determined theoretical coverage accuracies of various bootstrap CI types for estimators that are smooth functions of the data. Bootstrapt CIs are formed using the standard error, se~, of a single bootstrap replication (Efron and Tibshirani 1993). o _
For simple estimators like /2 = X, plugin estimates can be used instead of se~. However, for more complex estimators, no plugin estimates are o at hand. A second bootstrap loop (bootstrapping from bootstrap samples) had to be invoked, which would increase computing costs.
Bootstrap calibration can strongly increase CI coverage accuracy. Consider that a single CI point is sought, say, the lower bound, 0\, for an estimate, 0. Let the bound be calculated for each bootstrap sample, b = 1,..., B, and over a grid of confidence levels, for example,
For each A, compute
Finally, solve p0(A) = a for A. In case a twosided, equitailed CI is sought, the calibration curve p0(A) = 1 — 2a, where
is solved for A. To calculate the CI points for a bootstrap sample requires to perform a second bootstrapestimation loop. Analysing secondloop bootstrap methods like calibration or bootstrapt interval construction may require enormous computing costs. Relevant papers on calibrated bootstrap CIs include Hall (1986), Loh (1987, 1991), Hall and Martin (1988), Martin (1990) and Booth and Hall (1994). Regarding the context of resampling data from serially dependent processes, Choi and Hall (2000) report that the sieve or AR(ro) bootstrap has a significantly better performance than blocking methods in CI calibration. However, the sieve bootstrap is not applicable to unevenly spaced time series. This books presents a Monte Carlo experiment on calibrated bootstrap CIs for correlation estimation (Chapter 7), with satisfying coverage performance despite the used MBB resampling.
Bootstrap hypothesis tests are detailed by Davison and Hinkley (1997: Chapter 4 therein), see also Efron and Tibshirani (1993: Chapter 15 therein) and Lehmann and Romano (2005: Chapter 15 therein).
The relation between making a test statistic pivotal and bootstrap CI calibration is described by Beran (1987, 1988). Guidelines for bootstrap hypothesis testing are provided by Hall and Wilson (1991). An extension of MBB hypothesis testing of the mean from univariate to multivariate time series has been presented by Wilks (1997). The dimensionality may be rather high, and the method may therefore be applicable to timedependent climate fields such as gridded temperature output from a mathematical climate model. Beersma and Buishand (1999) compare variances of bivariate time series using jackknife resampling. They find significantly higher variability of future northern European precipitation amounts in the computer simulation with elevated greenhouse gas concentrations than in the simulation without (control run). Huybers and Wunsch (2005) test the hypothesis that Earth's obliquity variations influence glacial terminations during the late Pleistocene using parametric resampling of the timescale (Section 4.1.7).
Multiple hypothesis tests may be performed when analysing a hypothesis that consists of several subhypotheses. This situation arises in spectrum estimation (Chapter 5), where a range of frequencies is examined. The traditional method is adjusting the Pvalues of the individual tests to yield the desired overall Pvalue. A recent paper (Storey 2007: p. 347 therein) states "that one can improve the overall performance of multiple significance tests by borrowing information across all the tests when assessing the relative significance of each one, rather than calculating Pvalues for each test individually."
The anthropogenic warming signal has stimulated much work applying various types of hypothesis tests using measured and AOGCM temperature data. More details on the fingerprint approach are contained in the following papers: Hasselmann (1997), Hegerl and North (1997) and Hegerl et al. (1997). Correlation approaches to detect the anthropogenic warming signal are described by Folland et al. (1998) and Wigley et al. (2000). A recent overview is given by Barnett et al. (2005).
3.9 Technical issues
The standard normal (Gaussian) distribution has following PDF:
Figure 3.1 shows the distributional shape. The distribution function, f (x) = (2n)1/2 exp (—x2/2) .
Figure 3.1 shows the distributional shape. The distribution function, x
cannot be expressed in closed analytical form. We use
where for x > 0 the complementary error function, erfcc, is approximated (Press et al. 1992: Section 6.2 therein) via erfcc(u) w v exp(w2  1.26551223 + v (1.00002368 + v (0.37409196 + v (0.09678418 + v (0.18628806 + v (0.27886807 + v (1.13520398 + v (1.48851587
For x < 0, use the symmetry, F(x) = 1  F(x). For all x, this approximation has a relative error of less than 1.2 ■ 107 (Press et al. 1992). The inverse function of F(x) defines the percentage point on the x axis, z(ft), with 0 < ft < 1. Approximations are used for calculating z(ft); for the Monte Carlo simulation experiments in this book, the formula given by Odeh and Evans (1974) is employed:
where and
Post a comment