Title: | Simulation-Based Inference for Regression Models |
---|---|
Description: | Performs simulation-based inference as an alternative to the delta method for obtaining valid confidence intervals and p-values for regression post-estimation quantities, such as average marginal effects and predictions at representative values. This framework for simulation-based inference is especially useful when the resulting quantity is not normally distributed and the delta method approximation fails. The methodology is described in King, Tomz, and Wittenberg (2000) <doi:10.2307/2669316>. 'clarify' is meant to replace some of the functionality of the archived package 'Zelig'; see the vignette "Translating Zelig to clarify" for replicating this functionality. |
Authors: | Noah Greifer [aut, cre] , Steven Worthington [aut] , Stefano Iacus [aut] , Gary King [aut] |
Maintainer: | Noah Greifer <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.1 |
Built: | 2024-10-27 05:07:36 UTC |
Source: | https://github.com/iqss/clarify |
misim()
simulates model parameters from multivariate normal or t distributions after multiple imputation that are then used by sim_apply()
to calculate quantities of interest.
misim(fitlist, n = 1000, vcov = NULL, coefs = NULL, dist = NULL)
misim(fitlist, n = 1000, vcov = NULL, coefs = NULL, dist = NULL)
fitlist |
a list of model fits, one for each imputed dataset, or a |
n |
the number of simulations to run for each imputed dataset; default is 1000. More is always better but resulting calculations will take longer. |
vcov |
a square covariance matrix of the coefficient covariance estimates, a function to use to extract it from |
coefs |
a vector of coefficient estimates, a function to use to extract it from |
dist |
a character vector containing the name of the multivariate distribution(s) to use to draw simulated coefficients. Should be one of |
misim()
essentially combines multiple sim()
calls applied to a list of model fits, each fit in an imputed dataset, into a single combined pool of simulated coefficients. When simulation-based inference is to be used with multiply imputed data, many imputations are required; see Zhou and Reiter (2010).
A clarify_misim
object, which inherits from clarify_sim
and has the following components:
sim.coefs |
a matrix containing the simulated coefficients with a column for each coefficient and a row for each simulation for each imputation |
coefs |
a matrix containing the original coefficients extracted from |
fit |
the list of model fits supplied to |
imp |
a identifier of which imputed dataset each set of simulated coefficients corresponds to. |
The "dist"
attribute contains "normal"
if the coefficients were sampled from a multivariate normal distribution and "t({df})"
if sampled from a multivariate t distribution. The "clarify_hash"
attribute contains a unique hash generated by rlang::hash()
.
Zhou, X., & Reiter, J. P. (2010). A Note on Bayesian Inference After Multiple Imputation. The American Statistician, 64(2), 159–163. doi:10.1198/tast.2010.09109
sim()
for simulating model coefficients for a single dataset
sim_apply()
for applying a function to each set of simulated coefficients
sim_ame()
for computing average marginal effects in each simulation draw
sim_setx()
for computing marginal predictions and first differences at typical values in each simulation draw
data("africa", package = "Amelia") # Multiple imputation using Amelia a.out <- Amelia::amelia(x = africa, m = 10, cs = "country", ts = "year", logs = "gdp_pc", p2s = 0) fits <- with(a.out, lm(gdp_pc ~ infl * trade)) # Simulate coefficients s <- misim(fits) s
data("africa", package = "Amelia") # Multiple imputation using Amelia a.out <- Amelia::amelia(x = africa, m = 10, cs = "country", ts = "year", logs = "gdp_pc", p2s = 0) fits <- with(a.out, lm(gdp_pc ~ infl * trade)) # Simulate coefficients s <- misim(fits) s
sim_adrf()
plot.clarify_adrf()
plots the output of sim_adrf()
. For the average dose-response function (ADRF, requested with contrast = "adrf"
in sim_adrf()
), this is a plot of the average marginal mean of the outcome against the requested values of the focal predictor; for the average marginal effects function (AMEF, requested with contrast = "amef"
in sim_adrf()
), this is a plot of the instantaneous average marginal effect of the focal predictor on the outcome against the requested values of the focal predictor.
## S3 method for class 'clarify_adrf' plot( x, ci = TRUE, level = 0.95, method = "quantile", baseline, color = "black", ... )
## S3 method for class 'clarify_adrf' plot( x, ci = TRUE, level = 0.95, method = "quantile", baseline, color = "black", ... )
x |
a |
ci |
|
level |
the confidence level desired. Default is .95 for 95% confidence intervals. |
method |
the method used to compute confidence bands. Can be |
baseline |
|
color |
the color of the line and confidence band in the plot. |
... |
for |
These plots are produced using ggplot2::geom_line()
and ggplot2::geom_ribbon()
. The confidence bands should be interpreted pointwise (i.e., they do not account for simultaneous inference).
A ggplot
object.
summary.clarify_est()
for computing p-values and confidence intervals for the estimated quantities.
## See help("sim_adrf") for examples
## See help("sim_adrf") for examples
clarify_est
objectssummary()
tabulates the estimates and confidence intervals and (optionally) p-values from a clarify_est
object. confint()
computes confidence intervals. plot()
plots the "posterior" distribution of estimates.
## S3 method for class 'clarify_est' plot( x, parm, ci = TRUE, level = 0.95, method = "quantile", reference = FALSE, ncol = 3, ... ) ## S3 method for class 'clarify_est' summary(object, parm, level = 0.95, method = "quantile", null = NA, ...) ## S3 method for class 'clarify_est' confint(object, parm, level = 0.95, method = "quantile", ...)
## S3 method for class 'clarify_est' plot( x, parm, ci = TRUE, level = 0.95, method = "quantile", reference = FALSE, ncol = 3, ... ) ## S3 method for class 'clarify_est' summary(object, parm, level = 0.95, method = "quantile", null = NA, ...) ## S3 method for class 'clarify_est' confint(object, parm, level = 0.95, method = "quantile", ...)
parm |
a vector of the names or indices of the estimates to plot. If unspecified, all estimates will be displayed. |
ci |
|
level |
the confidence level desired. Default is .95 for 95% confidence intervals. |
method |
the method used to compute p-values and confidence intervals. Can be |
reference |
|
ncol |
the number of columns used when wrapping multiple plots; default is 3. |
... |
for |
object , x
|
a |
null |
the values of the parameters under the null hypothesis for the p-value calculations. Should have length equal to the number of quantities estimated, or one, in which case it will be recycled, or it can be a named vector with just the names of quantities for which null values are to be set. Set values to |
summary()
uses the estimates computed from the original model as its estimates and uses the simulated parameters for inference only, in line with the recommendations of Rainey (2023).
When method = "wald"
, the standard deviation of the simulation estimates is used as the standard error, which is used in the z-statistics and the confidence intervals. The p-values and confidence intervals are valid only when the sampling distribution of the resulting statistic is normal (which can be assessed using plot()
). When method = "quantile"
, the confidence interval is calculated using the quantiles of the simulation estimates corresponding to level
, and the p-value is calculated as twice the proportion of simulation estimates less than or greater than null
, whichever is smaller; this is equivalent to inverting the confidence interval but is only truly valid when the true sampling distribution is only a location shift from the sampling distribution under the null hypothesis and should therefore be interpreted with caution. Using "method = "quantile"
(the default) is recommended because the confidence intervals will be valid even if the sampling distribution is not Normally distributed. The precision of the p-values and confidence intervals depends on the number of simulations requested (the value of n
supplied to sim()
).
The plots are produced using ggplot2::geom_density()
and can be customized with ggplot2 functions. When reference = TRUE
, a reference Normal distribution is produced using the empirical mean and standard deviation of the simulated values. A blue references line is plotted at the median of the simulated values. For Wald-based inference to be valid, the reference distribution should overlap with the empirical distribution, in which case the quantile-based and Wald-based intervals should be similar. For quantile-based inference to be valid, the median of the estimates should overlap with the estimated value; this is a necessary but not sufficient condition, though.
For summary()
, a summary.clarify_est
object, which is a matrix containing the coefficient estimates, standard errors, test statistics, p-values, and confidence intervals. Not all columns will be present depending on the arguments supplied to summary()
.
For confint()
, a matrix containing the confidence intervals for the requested quantities.
For plot()
, a ggplot
object.
Rainey, C. (2023). A careful consideration of CLARIFY: Simulation-induced bias in point estimates of quantities of interest. Political Science Research and Methods, 1–10. doi:10.1017/psrm.2023.8
sim_apply()
for applying a function to each set of simulated coefficients
data("lalonde", package = "MatchIt") fit <- glm(I(re78 > 0) ~ treat + age + race + nodegree + re74, data = lalonde) s <- sim(fit, n = 100) # Compute average marginal means for `treat` est <- sim_ame(s, var = "treat", verbose = FALSE) coef(est) # Compute average marginal effects on risk difference # (RD) and risk ratio (RR) scale est <- transform(est, RD = `E[Y(1)]` - `E[Y(0)]`, RR = `E[Y(1)]` / `E[Y(0)]`) # Compute confidence intervals and p-values, # using given null values for computing p-values summary(est, null = c(`RD` = 0, `RR` = 1)) # Same tests using normal approximation and alternate # syntax for `null` summary(est, null = c(NA, NA, 0, 1), normal = TRUE) # Plot the RD and RR with a reference distribution plot(est, parm = c("RD", "RR"), reference = TRUE, ci = FALSE) # Plot the RD and RR with quantile confidence bounds plot(est, parm = c("RD", "RR"), ci = TRUE)
data("lalonde", package = "MatchIt") fit <- glm(I(re78 > 0) ~ treat + age + race + nodegree + re74, data = lalonde) s <- sim(fit, n = 100) # Compute average marginal means for `treat` est <- sim_ame(s, var = "treat", verbose = FALSE) coef(est) # Compute average marginal effects on risk difference # (RD) and risk ratio (RR) scale est <- transform(est, RD = `E[Y(1)]` - `E[Y(0)]`, RR = `E[Y(1)]` / `E[Y(0)]`) # Compute confidence intervals and p-values, # using given null values for computing p-values summary(est, null = c(`RD` = 0, `RR` = 1)) # Same tests using normal approximation and alternate # syntax for `null` summary(est, null = c(NA, NA, 0, 1), normal = TRUE) # Plot the RD and RR with a reference distribution plot(est, parm = c("RD", "RR"), reference = TRUE, ci = FALSE) # Plot the RD and RR with quantile confidence bounds plot(est, parm = c("RD", "RR"), ci = TRUE)
sim_setx()
plot.clarify_sext()
plots the output of sim_setx()
, providing graphics similar to those of plot.clarify_est()
but with features specifically for plot marginal predictions. For continues predictors, this is a plot of the marginal predictions and their confidence bands across levels of the predictor. Otherwise, this is is a plot of simulated sampling distribution of the marginal predictions.
## S3 method for class 'clarify_setx' plot( x, var = NULL, ci = TRUE, level = 0.95, method = "quantile", reference = FALSE, ... )
## S3 method for class 'clarify_setx' plot( x, var = NULL, ci = TRUE, level = 0.95, method = "quantile", reference = FALSE, ... )
x |
a |
var |
the name of the focal varying predictor, i.e., the variable to be on the x-axis of the plot. All other variables with varying set values will be used to color the resulting plot. See Details. Ignored if no predictors vary or if only one predictor varies in the reference grid or if |
ci |
|
level |
the confidence level desired. Default is .95 for 95% confidence intervals. |
method |
the method used to compute confidence intervals or bands. Can be |
reference |
|
... |
for |
plot()
creates one of two kinds of plots depending on how the reference grid was specified in the call to sim_setx()
and what var
is set to. When the focal varying predictor (i.e., the one set in var
) is numeric and takes on three or more unique values in the reference grid, the produced plot is a line graph displaying the value of the marginal prediction (denoted as E[Y|X]
) across values of the focal varying predictor, with confidence bands displayed when ci = TRUE
. If other predictors also vary, lines for different values will be displayed in different colors. These plots are produced using ggplot2::geom_line()
and ggplot2::geom_ribbon()
When the focal varying predictor is a factor or character or only takes on two or fewer values in the reference grid, the produced plot is a density plot of the simulated predictions, similar to the plot resulting from plot.clarify_est()
. When other variables vary, densities for different values will be displayed in different colors. These plots are produced using ggplot2::geom_density()
.
Marginal predictions are identified by the corresponding levels of the predictors that vary. The user should keep track of whether the non-varying predictors are set at specified or automatically set "typical" levels.
A ggplot
object.
summary.clarify_est()
for computing p-values and confidence intervals for the estimated quantities.
## See help("sim_setx") for examples
## See help("sim_setx") for examples
sim()
simulates model parameters from a multivariate normal or t distribution that are then used by sim_apply()
to calculate quantities of interest.
sim(fit, n = 1000, vcov = NULL, coefs = NULL, dist = NULL)
sim(fit, n = 1000, vcov = NULL, coefs = NULL, dist = NULL)
fit |
a model fit, such as the output of a call to |
n |
the number of simulations to run; default is 1000. More is always better but resulting calculations will take longer. |
vcov |
either a square covariance matrix of the coefficient covariance estimates or a function to use to extract it from |
coefs |
either a vector of coefficient estimates or a function to use to extract it from |
dist |
a string containing the name of the multivariate distribution to use to draw simulated coefficients. Should be one of |
When dist
is NULL
, sim()
samples from a multivariate normal or t distribution depending on the degrees of freedom extracted from insight::get_df(., type = "wald")
. If Inf
, a normal distribution will be used; otherwise, a t-distribution with the returned degrees of freedom will be used. Models not supported by insight
will use a normal distribution.
When a multivariate normal is used, it is sampled from with means equal to the estimated coefficients and the parameter covariance matrix as the covariance matrix using mvnfast::rmvn()
. When a multivariate t distribution is used, it is sampled from with means equal to the estimated coefficients and scaling matrix equal to cov*(df - 2)/df
, where cov
is the parameter covariance matrix and df
is the residual degrees of freedom for the model, using mvnfast::rmvt()
.
A clarify_sim
object, which has the following components:
sim.coefs |
a matrix containing the simulated coefficients with a column for each coefficient and a row for each simulation |
coefs |
the original coefficients extracted from |
vcov |
the covariance matrix of the coefficients extracted from |
fit |
the original model fit supplied to |
The "dist"
attribute contains "normal"
if the coefficients were sampled from a multivariate normal distribution and "t(df)"
if sampled from a multivariate t distribution. The "clarify_hash"
attribute contains a unique hash generated by rlang::hash()
.
misim()
for simulating model coefficients after multiple imputation
sim_apply()
for applying a function to each set of simulated coefficients
sim_ame()
for computing average marginal effects in each simulation draw
sim_setx()
for computing marginal predictions and first differences at typical values in each simulation draw
sim_adrf()
for computing average dose-response functions in each simulation draw
data("lalonde", package = "MatchIt") fit <- lm(re78 ~ treat * (age + race + nodegree + re74), data = lalonde) # Simulate coefficients s <- sim(fit) s ## Could also use a robust covariance matrix, e.g., s <- sim(fit, vcov = "HC3") # Simulated coefficients assuming a normal distribution # for coefficients; default for `lm` objects is a t- # distribution s <- sim(fit, dist = "normal") s
data("lalonde", package = "MatchIt") fit <- lm(re78 ~ treat * (age + race + nodegree + re74), data = lalonde) # Simulate coefficients s <- sim(fit) s ## Could also use a robust covariance matrix, e.g., s <- sim(fit, vcov = "HC3") # Simulated coefficients assuming a normal distribution # for coefficients; default for `lm` objects is a t- # distribution s <- sim(fit, dist = "normal") s
sim_adrf()
is a wrapper for sim_apply()
that computes average dose-response functions (ADRFs) and average marginal effect functions (AMEFs). An ADRF describes the relationship between values a focal variable can take and the expected value of the outcome were all units to be given each value of the variable. An AMEF describes the relationship between values a focal variable can take and the derivative of ADRF at each value.
sim_adrf( sim, var, subset = NULL, by = NULL, contrast = "adrf", at = NULL, n = 21, outcome = NULL, type = NULL, eps = 1e-05, verbose = TRUE, cl = NULL ) ## S3 method for class 'clarify_adrf' print(x, digits = NULL, max.ests = 6, ...)
sim_adrf( sim, var, subset = NULL, by = NULL, contrast = "adrf", at = NULL, n = 21, outcome = NULL, type = NULL, eps = 1e-05, verbose = TRUE, cl = NULL ) ## S3 method for class 'clarify_adrf' print(x, digits = NULL, max.ests = 6, ...)
sim |
a |
var |
the name of a variable for which the ADRF or AMEF is to be computed. This variable must be present in the model supplied to |
subset |
optional; a vector used to subset the data used to compute the ADRF or AMEF. This will be evaluated within the original dataset used to fit the model using |
by |
a one-sided formula or character vector containing the names of variables for which to stratify the estimates. Each quantity will be computed within each level of the complete cross of the variables specified in |
contrast |
a string naming the type of quantity to be produced: |
at |
the levels of the variable named in |
n |
when |
outcome |
a string containing the name of the outcome or outcome level for multivariate (multiple outcomes) or multi-category outcomes. Ignored for univariate (single outcome) and binary outcomes. |
type |
a string containing the type of predicted values (e.g., the link or the response). Passed to |
eps |
when |
verbose |
|
cl |
a cluster object created by |
x |
a |
digits |
the minimum number of significant digits to be used; passed to |
max.ests |
the maximum number of estimates to display. |
... |
optional arguments passed to |
The ADRF is composed of average marginal means across levels of the focal predictor. For each level of the focal predictor, predicted values of the outcome are computed after setting the value of the predictor to that level, and those values of the outcome are averaged across all units in the sample to arrive at an average marginal mean. Thus, the ADRF represent the relationship between the "dose" (i.e., the level of the focal predictor) and the average "response" (i.e., the outcome variable). It is the continuous analog to the average marginal effect computed for a binary predictor, e.g., using sim_ame()
. Although inference can be at each level of the predictor or between two levels of the predictor, typically a plot of the ADRF is the most useful relevant quantity. These can be requested using plot.clarify_adrf()
.
The AMEF is the derivative of the ADRF; if we call the derivative of the ADRF at each point a "treatment effect" (i.e., the rate at which the outcome changes corresponding to a small change in the predictor, or "treatment"), the AMEF is a function that relates the size of the treatment effect to the level of the treatment. The shape of the AMEF is usually of less importance than the value of the AMEF at each level of the predictor, which corresponds to the size of the treatment effect at the corresponding level. The AMEF is computed by computing the ADRF at each level of the focal predictor specified in at
, shifting the predictor value by a tiny amount (control by eps
), and computing the ratio of the change in the outcome to the shift, then averaging this value across all units. This quantity is related the the average marginal effect of a continuous predictor as computed by sim_ame()
, but rather than average these treatment effects across all observed levels of the treatment, the AMEF is a function evaluated at each possible level of the treatment. The "tiny amount" used is eps
times the standard deviation of var
.
A clarify_adrf
object, which inherits from clarify_est
and is similar to
the output of sim_apply()
, with the additional attributes "var"
containing
the variable named in var
, "by"
containing the names of the variables specified in by
(if any), "at"
containing values at which the ADRF or AMEF is evaluated, and "contrast"
containing the argument supplied to contrast
. For an ADRF, the average marginal means will be named
E[Y({v})]
, where {v}
is replaced with the values in at
. For an AMEF, the average marginal effects will be
named dY/d({x})|{a}
where {x}
is replaced with var
and {a}
is replaced by the values in at
.
plot.clarify_adrf()
for plotting the ADRF or AMEF; sim_ame()
for computing average marginal effects; sim_apply()
, which provides a general interface to computing any
quantities for simulation-based inference; summary.clarify_est()
for computing
p-values and confidence intervals for the estimated quantities.
marginaleffects::avg_slopes()
and marginaleffects::avg_predictions()
for delta method-based implementations of computing average marginal effects and average marginal means.
data("lalonde", package = "MatchIt") # Fit the model fit <- glm(I(re78 > 0) ~ treat + age + race + married + re74, data = lalonde, family = binomial) # Simulate coefficients set.seed(123) s <- sim(fit, n = 100) # ADRF for `age` est <- sim_adrf(s, var = "age", at = seq(15, 55, length.out = 6), verbose = FALSE) est plot(est) # AMEF for `age` est <- sim_adrf(s, var = "age", contrast = "amef", at = seq(15, 55, length.out = 6), verbose = FALSE) est summary(est) plot(est) # ADRF for `age` within levels of `married` est <- sim_adrf(s, var = "age", at = seq(15, 55, length.out = 6), by = ~married, verbose = FALSE) est plot(est) ## Difference between ADRFs est_diff <- est[7:12] - est[1:6] plot(est_diff) + ggplot2::labs(y = "Diff")
data("lalonde", package = "MatchIt") # Fit the model fit <- glm(I(re78 > 0) ~ treat + age + race + married + re74, data = lalonde, family = binomial) # Simulate coefficients set.seed(123) s <- sim(fit, n = 100) # ADRF for `age` est <- sim_adrf(s, var = "age", at = seq(15, 55, length.out = 6), verbose = FALSE) est plot(est) # AMEF for `age` est <- sim_adrf(s, var = "age", contrast = "amef", at = seq(15, 55, length.out = 6), verbose = FALSE) est summary(est) plot(est) # ADRF for `age` within levels of `married` est <- sim_adrf(s, var = "age", at = seq(15, 55, length.out = 6), by = ~married, verbose = FALSE) est plot(est) ## Difference between ADRFs est_diff <- est[7:12] - est[1:6] plot(est_diff) + ggplot2::labs(y = "Diff")
sim_ame()
is a wrapper for sim_apply()
that computes average
marginal effects, the average effect of changing a single variable from one
value to another (i.e., from one category to another for categorical
variables or a tiny change for continuous variables).
sim_ame( sim, var, subset = NULL, by = NULL, contrast = NULL, outcome = NULL, type = NULL, eps = 1e-05, verbose = TRUE, cl = NULL ) ## S3 method for class 'clarify_ame' print(x, digits = NULL, max.ests = 6, ...)
sim_ame( sim, var, subset = NULL, by = NULL, contrast = NULL, outcome = NULL, type = NULL, eps = 1e-05, verbose = TRUE, cl = NULL ) ## S3 method for class 'clarify_ame' print(x, digits = NULL, max.ests = 6, ...)
sim |
a |
var |
either the names of the variables for which marginal effects are to be computed or a named list containing the values the variables should take. See Details. |
subset |
optional; a vector used to subset the data used to compute the
marginal effects. This will be evaluated within the original dataset used
to fit the model using |
by |
a one-sided formula or character vector containing the names of
variables for which to stratify the estimates. Each quantity will be
computed within each level of the complete cross of the variables specified
in |
contrast |
a string containing the name of a contrast between the
average marginal means when the variable named in |
outcome |
a string containing the name of the outcome or outcome level for multivariate (multiple outcomes) or multi-category outcomes. Ignored for univariate (single outcome) and binary outcomes. |
type |
a string containing the type of predicted values (e.g., the link
or the response). Passed to |
eps |
when the variable named in |
verbose |
|
cl |
a cluster object created by |
x |
a |
digits |
the minimum number of significant digits to be used; passed to |
max.ests |
the maximum number of estimates to display. |
... |
optional arguments passed to |
sim_ame()
computes average adjusted predictions or average marginal effects depending on which variables are named in var
and how they are specified. Canonically, var
should be specified as a named list with the value(s) each variable should be set to. For example, specifying var = list(x1 = 0:1)
computes average adjusted predictions setting x1
to 0 and 1. Specifying a variable's values as NULL
, e.g., list(x1 = NULL)
, is equivalent to requesting average adjusted predictions at each unique value of the variable when that variable is binary or a factor or character and requests the average marginal effect of that variable otherwise. Specifying an unnamed entry in the list with a string containing the value of that variable, e.g., list("x1")
is equivalent to specifying list(x1 = NULL)
. Similarly, supplying a vector with the names of the variables is equivalent to specifying a list, e.g., var = "x1"
is equivalent to var = list(x1 = NULL)
.
Multiple variables can be supplied to var
at the same time to set the corresponding variables to those values. If all values are specified directly or the variables are categorical, e.g., list(x1 = 0:1, x2 = c(5, 10))
, this computes average adjusted predictions at each combination of the supplied variables. If any one variable's values is specified as NULL
and the variable is continuous, the average marginal effect of that variable will be computed with the other variables set to their corresponding combinations. For example, if x2
is a continuous variable, specifying var = list(x1 = 0:1, x2 = NULL)
requests the average marginal effect of x2
computed first setting x1
to 0 and then setting x1
to 1. The average marginal effect can only be computed for one variable at a time.
Below are some examples of specifications and what they request, assuming x1
is a binary variable taking on values of 0 and 1 and x2
is a continuous variable:
list(x1 = 0:1)
, list(x1 = NULL)
, list("x1")
, "x1"
– the average adjusted predictions setting x1
to 0 and to 1
list(x2 = NULL)
, list("x2")
, "x2"
– the average marginal effect of x2
list(x2 = c(5, 10))
– the average adjusted predictions setting x2
to 5 and to 10
list(x1 = 0:1, x2 = c(5, 10))
, list("x1", x2 = c(5, 10))
– the average adjusted predictions setting x1
and x2
in a full cross of 0, 1 and 5, 10, respectively (e.g., (0, 5), (0, 10), (1, 5), and (1, 10))
list(x1 = 0:1, "x2")
, list("x1", "x2")
, c("x1", "x2")
– the average marginal effects of x2
setting x1
to 0 and to 1
The average adjusted prediction is the average predicted outcome
value after setting all units' value of a variable to a specified level. (This quantity
has several names, including the average potential outcome, average marginal mean, and standardized mean). When exactly two average adjusted predictions are requested, a contrast
between them can be requested by supplying an argument
to contrast
(see Effect Measures section below). Contrasts can be manually computed using transform()
afterward as well; this is required when multiple average adjusted predictions are requested (i.e., because a single variable was supplied to var
with more than two levels or a combination of multiple variables was supplied).
A marginal effect is the instantaneous rate of change
corresponding to changing a unit's observed value of a variable by a tiny amount
and considering to what degree the predicted outcome changes. The ratio of
the change in the predicted outcome to the change in the value of the variable is
the marginal effect; these are averaged across the sample to arrive at an
average marginal effect. The "tiny amount" used is eps
times the standard
deviation of the focal variable.
The difference between using by
or subset
vs. var
is that by
and subset
subset the data when computing the requested quantity, whereas var
sets the corresponding variable to given a value for all units. For example, using by = ~v
computes the quantity of interest separately for each subset of the data defined by v
, whereas setting var = list(., "v")
computes the quantity of interest for all units setting their value of v
to its unique values. The resulting quantities have different interpretations. Both by
and var
can be used simultaneously.
The effect measures specified in contrast
are defined below. Typically only
"diff"
is appropriate for continuous outcomes and "diff"
or "irr"
are
appropriate for count outcomes; the rest are appropriate for binary outcomes.
For a focal variable with two levels, 0
and 1
, and an outcome Y
, the
average marginal means will be denoted in the below formulas as E[Y(0)]
and
E[Y(1)]
, respectively.
contrast |
Description | Formula |
"diff" /"rd" |
Mean/risk difference | E[Y(1)] - E[Y(0)] |
"rr" /"irr" |
Risk ratio/incidence rate ratio | E[Y(1)] / E[Y(0)] |
"sr" |
Survival ratio | (1 - E[Y(1)]) / (1 - E[Y(0)]) |
"srr" /"grrr" |
Switch risk ratio | 1 - sr if E[Y(1)] > E[Y(0)] |
rr - 1 if E[Y(1)] < E[Y(0)] |
||
0 otherwise |
||
"or" |
Odds ratio | O[Y(1)] / O[Y(0)] |
where O[Y(.)] = E[Y(.)] / (1 - E[Y(.)]) |
||
"nnt" |
Number needed to treat | 1 / rd |
The log(.)
versions are defined by taking the log()
(natural log) of the
corresponding effect measure.
A clarify_ame
object, which inherits from clarify_est
and is
similar to the output of sim_apply()
, with the additional attributes
"var"
containing the variable values specified in var
and "by"
containing the
names of the variables specified in by
(if any). The average adjusted
predictions will be named E[Y({v})]
, where {v}
is replaced with the
values the variables named in var
take on. The average marginal effect for a
continuous var
will be named E[dY/d({x})]
where {x}
is replaced with
var
. When by
is specified, the average adjusted predictions will be named
E[Y({v})|{b}]
and the average marginal effect E[dY/d({x})|{b}]
where
{b}
is a comma-separated list of of values of the by
variables at which
the quantity is computed. See examples.
sim_apply()
, which provides a general interface to computing any
quantities for simulation-based inference; plot.clarify_est()
for plotting the
output of a call to sim_ame()
; summary.clarify_est()
for computing
p-values and confidence intervals for the estimated quantities.
marginaleffects::avg_predictions()
, marginaleffects::avg_comparisons()
and marginaleffects::avg_slopes()
for delta method-based implementations of computing average marginal effects.
data("lalonde", package = "MatchIt") # Fit the model fit <- glm(I(re78 > 0) ~ treat + age + race + married + re74, data = lalonde, family = binomial) # Simulate coefficients set.seed(123) s <- sim(fit, n = 100) # Average marginal effect of `age` est <- sim_ame(s, var = "age", verbose = FALSE) summary(est) # Contrast between average adjusted predictions # for `treat` est <- sim_ame(s, var = "treat", contrast = "rr", verbose = FALSE) summary(est) # Average adjusted predictions for `race`; need to follow up # with contrasts for specific levels est <- sim_ame(s, var = "race", verbose = FALSE) est <- transform(est, `RR(h,b)` = `E[Y(hispan)]` / `E[Y(black)]`) summary(est) # Average adjusted predictions for `treat` within levels of # `married`, first using `subset` and then using `by` est0 <- sim_ame(s, var = "treat", subset = married == 0, contrast = "rd", verbose = FALSE) names(est0) <- paste0(names(est0), "|married=0") est1 <- sim_ame(s, var = "treat", subset = married == 1, contrast = "rd", verbose = FALSE) names(est1) <- paste0(names(est1), "|married=1") summary(cbind(est0, est1)) est <- sim_ame(s, var = "treat", by = ~married, contrast = "rd", verbose = FALSE) est summary(est) # Average marginal effect of `age` within levels of # married*race est <- sim_ame(s, var = "age", by = ~married + race, verbose = FALSE) est summary(est, null = 0) # Comparing AMEs between married and unmarried for # each level of `race` est_diff <- est[4:6] - est[1:3] names(est_diff) <- paste0("AME_diff|", levels(lalonde$race)) summary(est_diff) # Average adjusted predictions at a combination of `treat` # and `married` est <- sim_ame(s, var = c("treat", "married"), verbose = FALSE) est # Average marginal effect of `age` setting `married` to 1 est <- sim_ame(s, var = list("age", married = 1), verbose = FALSE)
data("lalonde", package = "MatchIt") # Fit the model fit <- glm(I(re78 > 0) ~ treat + age + race + married + re74, data = lalonde, family = binomial) # Simulate coefficients set.seed(123) s <- sim(fit, n = 100) # Average marginal effect of `age` est <- sim_ame(s, var = "age", verbose = FALSE) summary(est) # Contrast between average adjusted predictions # for `treat` est <- sim_ame(s, var = "treat", contrast = "rr", verbose = FALSE) summary(est) # Average adjusted predictions for `race`; need to follow up # with contrasts for specific levels est <- sim_ame(s, var = "race", verbose = FALSE) est <- transform(est, `RR(h,b)` = `E[Y(hispan)]` / `E[Y(black)]`) summary(est) # Average adjusted predictions for `treat` within levels of # `married`, first using `subset` and then using `by` est0 <- sim_ame(s, var = "treat", subset = married == 0, contrast = "rd", verbose = FALSE) names(est0) <- paste0(names(est0), "|married=0") est1 <- sim_ame(s, var = "treat", subset = married == 1, contrast = "rd", verbose = FALSE) names(est1) <- paste0(names(est1), "|married=1") summary(cbind(est0, est1)) est <- sim_ame(s, var = "treat", by = ~married, contrast = "rd", verbose = FALSE) est summary(est) # Average marginal effect of `age` within levels of # married*race est <- sim_ame(s, var = "age", by = ~married + race, verbose = FALSE) est summary(est, null = 0) # Comparing AMEs between married and unmarried for # each level of `race` est_diff <- est[4:6] - est[1:3] names(est_diff) <- paste0("AME_diff|", levels(lalonde$race)) summary(est_diff) # Average adjusted predictions at a combination of `treat` # and `married` est <- sim_ame(s, var = c("treat", "married"), verbose = FALSE) est # Average marginal effect of `age` setting `married` to 1 est <- sim_ame(s, var = list("age", married = 1), verbose = FALSE)
sim_apply()
applies a function that produces quantities of
interest to each set of simulated coefficients produced by sim()
; these
calculated quantities form the posterior sampling distribution for the
quantities of interest. Capabilities are available for parallelization.
sim_apply(sim, FUN, verbose = TRUE, cl = NULL, ...)
sim_apply(sim, FUN, verbose = TRUE, cl = NULL, ...)
sim |
a |
FUN |
a function to be applied to each set of simulated coefficients. See Details. |
verbose |
|
cl |
a cluster object created by |
... |
optional arguments passed to |
sim_apply()
applies a function, FUN
, to each set of simulated
coefficients, similar to apply()
. This function should return a numeric
vector containing one or more estimated quantities. This should be a named
vector to more easily keep track of the meaning of each estimated quantity.
Care should be taken to ensure that the returned vector is the same length
each time FUN
is called. NA
s are allowed in the output but should be
avoided if possible.
The arguments to FUN
can be specified in a few ways. If FUN
has an
argument called coefs
, a simulated set of coefficients will be passed to
this argument, and FUN
should compute and return a quantity based on the
coefficients (e.g., the difference between two coefficients if one wants to
test whether two coefficients are equal). If FUN
has an argument called
fit
, a model fit object of the same type as the one originally supplied
to sim()
(e.g., an lm
or glm
object) will be passed to this argument,
where the coefficients of the fit object have been replaced by the
simulated coefficients generated by sim()
, and FUN
should compute and
return a quantity based on the model fit (e.g., a computation based on the
output of predict()
). If neither coefs
nor fit
are the names of
arguments to FUN
, the model fit object with replaced coefficients will be
supplied to the first argument of FUN
.
When custom coefficients are supplied to sim()
, i.e., when the coefs
argument to sim()
is not left at its default value, FUN
must accept a
coefs
argument and a warning will be thrown if it accepts a fit
argument. This is because sim_apply()
does not know how to reconstruct
the original fit object with the new coefficients inserted. The quantities
computed by sim_apply()
must therefore be computed directly from the
coefficients.
If FUN
is not supplied at all, the simulated values of the coefficients will be returned in the output with a warning. Set FUN
to NULL
or verbose
to FALSE
to suppress this warning.
sim_apply()
with multiply imputed dataWhen using misim()
and sim_apply()
with multiply imputed data, the
coefficients are supplied to the model fit corresponding to the imputation
identifier associated with each set of coefficients, which means if FUN
uses a dataset extracted from a model (e.g., using insight::get_data()
), it will do so from the model fit in
the corresponding imputation.
The original estimates (see Value below) are computed as the mean of the
estimates across the imputations using the original coefficients averaged
across imputations. That is, first, the coefficients estimated in the
models in the imputed datasets are combined to form a single set of pooled
coefficients; then, for each imputation, the quantities of interest are
computed using the pooled coefficients; finally, the mean of the resulting
estimates across the imputations are taken as the "original" estimates.
Note this procedure is only valid for quantities with symmetric sampling
distributions, which excludes quantities like risk ratios and odds ratios,
but includes log risk ratios and log odds ratios. The desired quantities
can be transformed from their log versions using
transform()
.
A clarify_est
object, which is a matrix with a column for each
estimated quantity and a row for each simulation. The original estimates
(FUN
applied to the original coefficients or model fit object) are stored
in the attribute "original"
. The "sim_hash"
attribute contains the
simulation hash produced by sim()
.
sim()
for generating the simulated coefficients
summary.clarify_est()
for computing p-values and confidence intervals for
the estimated quantities
plot.clarify_est()
for plotting estimated
quantities and their simulated posterior sampling distribution.
data("lalonde", package = "MatchIt") fit <- lm(re78 ~ treat + age + race + nodegree + re74, data = lalonde) coef(fit) set.seed(123) s <- sim(fit, n = 500) # Function to compare predicted values for two units # using `fit` argument sim_fun <- function(fit) { pred1 <- unname(predict(fit, newdata = lalonde[1,])) pred2 <- unname(predict(fit, newdata = lalonde[2,])) c(pred1 = pred1, pred2 = pred2) } est <- sim_apply(s, sim_fun, verbose = FALSE) # Add difference between predicted values as # additional quantity est <- transform(est, `diff 1-2` = pred1 - pred2) # Examine estimates and confidence intervals summary(est) # Function to compare coefficients using `coefs` # argument sim_fun <- function(coefs) { setNames(coefs["racewhite"] - coefs["racehispan"], "wh - his") } est <- sim_apply(s, sim_fun, verbose = FALSE) # Examine estimates and confidence intervals summary(est) # Another way to do the above: est <- sim_apply(s, FUN = NULL) est <- transform(est, `wh - his` = `racewhite` - `racehispan`) summary(est, parm = "wh - his")
data("lalonde", package = "MatchIt") fit <- lm(re78 ~ treat + age + race + nodegree + re74, data = lalonde) coef(fit) set.seed(123) s <- sim(fit, n = 500) # Function to compare predicted values for two units # using `fit` argument sim_fun <- function(fit) { pred1 <- unname(predict(fit, newdata = lalonde[1,])) pred2 <- unname(predict(fit, newdata = lalonde[2,])) c(pred1 = pred1, pred2 = pred2) } est <- sim_apply(s, sim_fun, verbose = FALSE) # Add difference between predicted values as # additional quantity est <- transform(est, `diff 1-2` = pred1 - pred2) # Examine estimates and confidence intervals summary(est) # Function to compare coefficients using `coefs` # argument sim_fun <- function(coefs) { setNames(coefs["racewhite"] - coefs["racehispan"], "wh - his") } est <- sim_apply(s, sim_fun, verbose = FALSE) # Examine estimates and confidence intervals summary(est) # Another way to do the above: est <- sim_apply(s, FUN = NULL) est <- transform(est, `wh - his` = `racewhite` - `racehispan`) summary(est, parm = "wh - his")
sim_setx()
is a wrapper for sim_apply()
that computes predicted values of
the outcome at specified values of the predictors, sometimes called marginal
predictions. One can also compute the difference between two marginal
predictions (the "first difference"). Although any function that accepted
clarify_est
objects can be used with sim_setx()
output objects, a
special plotting function, plot.clarify_setx()
, can be used to plot marginal
predictions.
sim_setx( sim, x = list(), x1 = list(), outcome = NULL, type = NULL, verbose = TRUE, cl = NULL ) ## S3 method for class 'clarify_setx' print(x, digits = NULL, max.ests = 6, ...)
sim_setx( sim, x = list(), x1 = list(), outcome = NULL, type = NULL, verbose = TRUE, cl = NULL ) ## S3 method for class 'clarify_setx' print(x, digits = NULL, max.ests = 6, ...)
sim |
a |
x |
a data.frame containing a reference grid of predictor values or a named list of values each predictor should take defining such a
reference grid, e.g., For |
x1 |
a data.frame or named list of the value each predictor should take to compute the
first difference from the predictor combination specified in |
outcome |
a string containing the name of the outcome or outcome level for multivariate (multiple outcomes) or multi-category outcomes. Ignored for univariate (single outcome) and binary outcomes. |
type |
a string containing the type of predicted values (e.g., the link or the response). Passed to |
verbose |
|
cl |
a cluster object created by |
digits |
the minimum number of significant digits to be used; passed to |
max.ests |
the maximum number of estimates to display. |
... |
optional arguments passed to |
When x
is a named list of predictor values, they will be crossed
to form a reference grid for the marginal predictions. Any predictors not
set in x
are assigned their "typical" value, which, for factor,
character, logical, and binary variables is the mode, for numeric variables
is the mean, and for ordered variables is the median. These values can be
seen in the "setx"
attribute of the output object. If x
is empty, a
prediction will be made at a point corresponding to the typical value of
every predictor. Estimates are identified (in summary()
, etc.) only by
the variables that differ across predictions.
When x1
is supplied, the first difference is computed, which here is
considered as the difference between two marginal predictions. One marginal
prediction must be specified in x
and another, ideally with a single
predictor changed, specified in x1
.
a clarify_setx
object, which inherits from clarify_est
and is similar to the output of sim_apply()
, with the following additional attributes:
"setx"
- a data frame containing the values at which predictions are to be made
"fd"
- whether or not the first difference is to be computed; set to TRUE
if x1
is specified and FALSE
otherwise
sim_apply()
, which provides a general interface to computing any
quantities for simulation-based inference; plot.clarify_setx()
for plotting the
output of a call to sim_setx()
; summary.clarify_est()
for computing
p-values and confidence intervals for the estimated quantities.
data("lalonde", package = "MatchIt") fit <- lm(re78 ~ treat + age + educ + married + race + re74, data = lalonde) # Simulate coefficients set.seed(123) s <- sim(fit, n = 100) # Predicted values at specified values of values, typical # values for other predictors est <- sim_setx(s, x = list(treat = 0:1, re74 = c(0, 10000)), verbose = FALSE) summary(est) plot(est) # Predicted values at specified grid of values, typical # values for other predictors est <- sim_setx(s, x = list(age = c(20, 25, 30, 35), married = 0:1), verbose = FALSE) summary(est) plot(est) # First differences of treat at specified value of # race, typical values for other predictors est <- sim_setx(s, x = data.frame(treat = 0, race = "hispan"), x1 = data.frame(treat = 1, race = "hispan"), verbose = FALSE) summary(est) plot(est)
data("lalonde", package = "MatchIt") fit <- lm(re78 ~ treat + age + educ + married + race + re74, data = lalonde) # Simulate coefficients set.seed(123) s <- sim(fit, n = 100) # Predicted values at specified values of values, typical # values for other predictors est <- sim_setx(s, x = list(treat = 0:1, re74 = c(0, 10000)), verbose = FALSE) summary(est) plot(est) # Predicted values at specified grid of values, typical # values for other predictors est <- sim_setx(s, x = list(age = c(20, 25, 30, 35), married = 0:1), verbose = FALSE) summary(est) plot(est) # First differences of treat at specified value of # race, typical values for other predictors est <- sim_setx(s, x = data.frame(treat = 0, race = "hispan"), x1 = data.frame(treat = 1, race = "hispan"), verbose = FALSE) summary(est) plot(est)
clarify_est
objectstransform()
modifies a clarify_est
object by allowing for the calculation of new quantities from the existing quantities without re-simulating them. cbind()
binds two clarify_est
objects together.
## S3 method for class 'clarify_est' transform(`_data`, ...) ## S3 method for class 'clarify_est' cbind(..., deparse.level = 1)
## S3 method for class 'clarify_est' transform(`_data`, ...) ## S3 method for class 'clarify_est' cbind(..., deparse.level = 1)
_data |
the |
... |
for |
deparse.level |
ignored. |
For transform()
, the expression on the right side of the =
should use the names of the existing quantities (e.g., `E[Y(1)]` - `E[Y(1)]`
), with `
appropriately included when the quantity name include parentheses or brackets. Alternatively, it can use indexes prefixed by .b
, e.g., .b2 - .b1
, to refer to the corresponding quantity by position. This can aid in computing derived quantities of quantities with complicated names. (Note that if a quantity is named something like .b1
, it will need to be referred to by position rather than name, as the position-based label takes precedence). See examples. Setting an existing value to NULL
will remove that quantity from the object.
cbind()
does not rename the quanities or check for uniqueness of the names, so it is important to rename them yourself prior to combining the objects.
A clarify_est
object, either with new columns added (when using transform()
) or combining two clarify_est
objects. Note that any type attributes corresponding to the sim_apply()
wrapper used (e.g., sim_ame()
) is lost when using either function. This can affect any helper functions (e.g., plot()
) designed to work with the output of specific wrappers.
data("lalonde", package = "MatchIt") # Fit the model fit <- lm(re78 ~ treat * (age + educ + race + married + re74 + re75), data = lalonde) # Simulate coefficients set.seed(123) s <- sim(fit, n = 100) # Average adjusted predictions for `treat` within # subsets of `race` est_b <- sim_ame(s, var = "treat", verbose = FALSE, subset = race == "black") est_b est_h <- sim_ame(s, var = "treat", verbose = FALSE, subset = race == "hispan") est_h # Compute differences between adjusted predictions est_b <- transform(est_b, diff = `E[Y(1)]` - `E[Y(0)]`) est_b est_h <- transform(est_h, diff = `E[Y(1)]` - `E[Y(0)]`) est_h # Bind estimates together after renaming names(est_b) <- paste0(names(est_b), "_b") names(est_h) <- paste0(names(est_h), "_h") est <- cbind(est_b, est_h) est # Compute difference in race-specific differences est <- transform(est, `diff-diff` = .b6 - .b3) summary(est, parm = c("diff_b", "diff_h", "diff-diff")) # Remove last quantity by using `NULL` transform(est, `diff-diff` = NULL)
data("lalonde", package = "MatchIt") # Fit the model fit <- lm(re78 ~ treat * (age + educ + race + married + re74 + re75), data = lalonde) # Simulate coefficients set.seed(123) s <- sim(fit, n = 100) # Average adjusted predictions for `treat` within # subsets of `race` est_b <- sim_ame(s, var = "treat", verbose = FALSE, subset = race == "black") est_b est_h <- sim_ame(s, var = "treat", verbose = FALSE, subset = race == "hispan") est_h # Compute differences between adjusted predictions est_b <- transform(est_b, diff = `E[Y(1)]` - `E[Y(0)]`) est_b est_h <- transform(est_h, diff = `E[Y(1)]` - `E[Y(0)]`) est_h # Bind estimates together after renaming names(est_b) <- paste0(names(est_b), "_b") names(est_h) <- paste0(names(est_h), "_h") est <- cbind(est_b, est_h) est # Compute difference in race-specific differences est <- transform(est, `diff-diff` = .b6 - .b3) summary(est, parm = c("diff_b", "diff_h", "diff-diff")) # Remove last quantity by using `NULL` transform(est, `diff-diff` = NULL)