

boot(boot)                                   R Documentation

_B_o_o_t_s_t_r_a_p _R_e_s_a_m_p_l_i_n_g

_D_e_s_c_r_i_p_t_i_o_n_:

     Generate `R' bootstrap replicates of a statistic
     applied to data.  Both parametric and nonparametric
     resampling are possible.  For the nonparametric boot-
     strap, possible resampling methods are the ordinary
     bootstrap, the balanced bootstrap, antithetic resam-
     pling, and permutation.  For nonparametric multi-sample
     problems stratified resampling is used.  This is speci-
     fied by including a vector of strata in the call to
     boot.  Importance resampling weights may be specified.

_U_s_a_g_e_:

     boot(data, statistic, R, sim="ordinary", stype="i",
          strata=rep(1,n), L=NULL, m=0, weights=NULL,
          ran.gen=function(d, p) d, mle=NULL, ...)

_A_r_g_u_m_e_n_t_s_:

    data: The data as a vector, matrix or dataframe.  If it
          is a matrix or dataframe then each row is consid-
          ered as one multivariate observation.

statistic: A function which when applied to data returns a
          vector containing the statistic(s) of interest.
          When `sim="parametric"', the first argument to
          `statistic' must be the data.  For each replicate
          a simulated dataset returned by `ran.gen' will be
          passed.  In all other cases `statistic' must take
          at least two arguments.  The first argument passed
          will always be the original data. The second will
          be a vector of indices, frequencies or weights
          which define the bootstrap sample.  Further, if
          predictions are required, then a third argument is
          required which would be a vector of the random
          indices used to generate the bootstrap predic-
          tions.  Any further arguments can be passed to
          `statistic' through the `...{}' argument.

       R: The number of bootstrap replicates.  Usually this
          will be a single positive integer.  For importance
          resampling, some resamples may use one set of
          weights and others use a different set of weights.
          In this case `R' would be a vector of integers
          where each component gives the number of resamples
          from each of the rows of weights.

     sim: A character string indicating the type of simula-
          tion required.  Possible values are `"ordinary"'
          (the default), `"parametric"', `"balanced"',
          `"permutation"', or `"antithetic"'.  Importance
          resampling is specified by including importance
          weights; the type of importance resampling must
          still be specified but may only be `"ordinary"' or
          `"balanced"' in this case.

   stype: A character string indicating what the second
          argument of statistic represents.  Possible values
          of stype are `"i"' (indices - the default), `"f"'
          (frequencies), or `"w"' (weights).

  strata: An integer vector or factor specifying the strata
          for multi-sample problems.  This may be specified
          for any simulation, but is ignored when `sim' is
          `"parametric"'. When `strata' is supplied for a
          nonparametric bootstrap, the simulations are done
          within the specified strata.

       L: Vector of influence values evaluated at the obser-
          vations.  This is used only when `sim' is `"anti-
          thetic"'.  If not supplied, they are calculated
          through a call to `empinf'.  This will use the
          infinitesimal jackknife provided that `stype' is
          `"w"', otherwise the usual jackknife is used.

       m: The number of predictions which are to be made at
          each bootstrap replicate.  This is most useful for
          (generalized) linear models.  This can only be
          used when `sim' is `"ordinary"'.  `m' will usually
          be a single integer but, if there are strata, it
          may be a vector with length equal to the number of
          strata, specifying how many of the errors for pre-
          diction should come from each strata.  The actual
          predictions should be returned as the final part
          of the output of `statistic', which should also
          take a vector of indices of the errors to be used
          for the predictions.

 weights: Vector or matrix of importance weights. If a vec-
          tor then it should have as many elements as there
          are observations in `data'.  When simulation from
          more than one set of weights is required,
          `weights' should be a matrix where each row of the
          matrix is one set of importance weights.  If
          `weights' is a matrix then `R' must be a vector of
          length `nrow(weights)'.  This parameter is ignored
          if `sim' is not `"ordinary"' or `"balanced"'.

 ran.gen: This function is used only when `sim' is `"para-
          metric"' when it describes how random values are
          to be generated.  It should be a function of two
          arguments.  The first argument should be the
          observed data and the second argument consists of
          any other information needed (e.g. parameter esti-
          mates).  The second argument may be a list, allow-
          ing any number of items to be passed to `ran.gen'.
          The returned value should be a simulated data set
          of the same form as the observed data which will
          be passed to statistic to get a bootstrap repli-
          cate.  It is important that the returned value be
          of the same shape and type as the original
          dataset.  If `ran.gen' is not specified, the
          default is a function which returns the original
          `data' in which case all simulation should be
          included as part of `statistic'.  Use of
          `sim="parametric"' with a suitable `ran.gen'
          allows the user to implement any types of nonpara-
          metric resampling which are not supported
          directly.

     mle: The second argument to be passed to `ran.gen'.
          Typically these will be maximum likelihood esti-
          mates of the parameters.  For efficiency `mle' is
          often a list containing all of the objects needed
          by `ran.gen' which can be calculated using the
          original data set only.

     ...: Any other arguments for `statistic' which are
          passed unchanged each time it is called.  Any such
          arguments to `statistic' must follow the arguments
          which `statistic' is required to have for the sim-
          ulation.

_D_e_t_a_i_l_s_:

     The statistic to be bootstrapped can be as simple or
     complicated as desired as long as its arguments corre-
     spond to the dataset and (for a nonparametric boot-
     strap) a vector of indices, frequencies or weights.
     `statistic' is treated as a black box by the `boot'
     function and is not checked to ensure that these condi-
     tions are met.

     The first order balanced bootstrap is described in
     Davison, Hinkley and Schechtman (1986).  The antithetic
     bootstrap is described by Hall (1989) and is experimen-
     tal, particularly when used with strata.  The other
     non-parametric simulation types are the ordinary boot-
     strap (possibly with unequal probabilities), and permu-
     tation which returns random permutations of cases.  All
     of these methods work independently within strata if
     that argument is supplied.

     For the parametric bootstrap it is necessary for the
     user to specify how the resampling is to be conducted.
     The best way of accomplishing this is to specify the
     function `ran.gen' which will return a simulated data
     set from the observed data set and a set of parameter
     estimates specified in `mle'.

_V_a_l_u_e_:

     The returned value is an object of class `"boot"', con-
     taining the following components :

      t0: The observed value of `statistic' applied to
          `data'.

       t: A matrix with `R' rows each of which is a boot-
          strap replicate of `statistic'.

       R: The value of `R' as passed to `boot'.

    data: The `data' as passed to `boot'.

    seed: The value of `.Random.seed' when `boot' was
          called.

statistic: The function `statistic' as passed to `boot'.

     sim: Simulation type used.

   stype: Statistic type as passed to `boot'.

    call: The original call to `boot'.

  strata: The strata used.  This is the vector passed to
          `boot', if it was supplied or a vector of ones if
          there were no strata.  It is not returned if `sim'
          is `"parametric"'.

 weights: The importance sampling weights as passed to
          `boot' or the empirical distribution function
          weights if no importance sampling weights were
          specified.  It is omitted if `sim' is not one of
          `"ordinary"' or `"balanced"'.

  pred.i: If predictions are required (`m>0') this is the
          matrix of indices at which predictions were calcu-
          lated as they were passed to statistic.  Omitted
          if `m' is `0' or `sim' is not `"ordinary"'.

       L: The influence values used when `sim' is `"anti-
          thetic"'.  If no such values were specified and
          `stype' is not `"w"' then `L' is returned as con-
          secutive integers corresponding to the assumption
          that data is ordered by influence values.  This
          component is omitted when `sim' is not `"anti-
          thetic"'.

 ran.gen: The random generator function used if `sim' is
          `"parametric"'. This component is omitted for any
          other value of `sim'.

     mle: The parameter estimates passed to `boot' when
          `sim' is `"parametric"'.  It is omitted for all
          other values of `sim'.

_R_e_f_e_r_e_n_c_e_s_:

     There are many references explaining the bootstrap and
     its variations.  Among them are :

     Booth, J.G., Hall, P. and Wood, A.T.A. (1993) Balanced
     importance resampling for the bootstrap. Annals of
     Statistics, 21, 286-298.

     Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Meth-
     ods and Their Application. Cambridge University Press.

     Davison, A.C., Hinkley, D.V. and Schechtman, E. (1986)
     Efficient bootstrap simulation. Biometrika, 73,
     555-566.

     Efron, B. and Tibshirani, R. (1993) An Introduction to
     the Bootstrap.  Chapman  Hall.

     Gleason, J.R. (1988) Algorithms for balanced bootstrap
     simulations.
      American Statistician, 42, 263-266.

     Hall, P. (1989) Antithetic resampling for the boot-
     strap. Biometrika, 73, 713-724.

     Hinkley, D.V. (1988) Bootstrap methods (with Discus-
     sion).  Journal of the  Royal Statistical Society, B,
     50, 312-337, 355-370.

     Hinkley, D.V. and Shi, S. (1989) Importance sampling
     and the nested bootstrap.  Biometrika, 76, 435-446.

     Johns M.V. (1988) Importance sampling for bootstrap
     confidence intervals.  Journal of the American Statis-
     tical Association, 83, 709-714.

     Noreen, E.W. (1989) Computer Intensive Methods for
     Testing Hypotheses.  John Wiley  Sons.

_S_e_e _A_l_s_o_:

     `boot.array', `boot.ci', `boot.object', `censboot',
     `empinf', `jack.after.boot', `tilt.boot', `tsboot'

_E_x_a_m_p_l_e_s_:

     # usual bootstrap of the ratio of means using the city data
     data(city)
     ratio <- function(d, w)
          sum(d$x * w)/sum(d$u * w)
     boot(city, ratio, R=999, stype="w")

     # Stratified resampling for the difference of means.  In this
     # example we will look at the difference of means between the final
     # two series in the gravity data.
     data(gravity)
     diff.means <- function(d, f)
     {    n <- nrow(d)
          gp1 <- 1:table(as.numeric(d$series))[1]
          m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
          m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
          ss1 <- sum(d[gp1,1]^2 * f[gp1]) -
                 (m1 *  m1 * sum(f[gp1]))
          ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) -
                 (m2 *  m2 * sum(f[-gp1]))
          c(m1-m2, (ss1+ss2)/(sum(f)-2))
     }
     grav1 <- gravity[as.numeric(gravity[,2])>=7,]
     boot(grav1, diff.means, R=999, stype="f", strata=grav1[,2])

     #  In this example we show the use of boot in a prediction from
     #  regression based on the nuclear data.  This example is taken
     #  from Example 6.8 of Davison and Hinkley (1997).  Notice also
     #  that two extra arguments to statistic are passed through boot.
     data(nuclear)
     nuke <- nuclear[,c(1,2,5,7,8,10,11)]
     nuke.lm <- glm(log(cost)~date+log(cap)+ne+ ct+log(cum.n)+pt, data=nuke)
     nuke.diag <- glm.diag(nuke.lm)
     nuke.res <- nuke.diag$res*nuke.diag$sd
     nuke.res <- nuke.res-mean(nuke.res)

     #  We set up a new dataframe with the data, the standardized
     #  residuals and the fitted values for use in the bootstrap.
     nuke.data <- data.frame(nuke,resid=nuke.res,fit=fitted(nuke.lm))

     #  Now we want a prediction of plant number 32 but at date 73.00
     new.data <- data.frame(cost=1, date=73.00, cap=886, ne=0,
                            ct=0, cum.n=11, pt=1)
     new.fit <- predict(nuke.lm, new.data)

     nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
     {
          assign(".inds", inds, envir=.GlobalEnv)
          lm.b <- glm(fit+resid[.inds] ~date+log(cap)+ne+ct+
               log(cum.n)+pt, data=dat)
          pred.b <- predict(lm.b,x.pred)
          remove(".inds", envir=.GlobalEnv)
          c(coef(lm.b), pred.b-(fit.pred+dat$resid[i.pred]))
     }

     nuke.boot <- boot(nuke.data, nuke.fun, R=999, m=1,
          fit.pred=new.fit, x.pred=new.data)
     #  The bootstrap prediction error would then be found by
     mean(nuke.boot$t[,8]^2)
     #  Basic bootstrap prediction limits would be
     new.fit-sort(nuke.boot$t[,8])[c(975,25)]

     #  Finally a parametric bootstrap.  For this example we shall look
     #  at the air-conditioning data.  In this example our aim is to test
     #  the hypothesis that the true value of the index is 1 (i.e. that
     #  the data come from an exponential distribution) against the
     #  alternative that the data come from a gamma distribution with
     #  index not equal to 1.
     air.fun <- function(data)
     {    ybar <- mean(data$hours)
          para <- c(log(ybar),mean(log(data$hours)))
          ll <- function(k) {
               if (k <= 0) out <- 1e200 # not NA
               else out <- lgamma(k)-k*(log(k)-1-para[1]+para[2])
              out
          }
          khat <- nlm(ll,ybar^2/var(data$hours))$estimate
          c(ybar, khat)
     }

     air.rg <- function(data, mle)
     #  Function to generate random exponential variates.  mle will contain
     #  the mean of the original data
     {    out <- data
          out$hours <- rexp(nrow(out), 1/mle)
          out
     }

     data(aircondit)
     air.boot <- boot(aircondit, air.fun, R=999, sim="parametric",
          ran.gen=air.rg, mle=mean(aircondit$hours))

     # The bootstrap p-value can then be approximated by
     sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)

