Model averaging via mixture modeling, pseudo-BMA, pseudo-BMA+ and stacking
To build intuition about model averaging methods, this tutorial compares mixture modeling to different model averaging approaches possible in BayesBlend: pseudo-BMA, pseudo-BMA+, maximum-likelihood stacking and Bayesian stacking. BayesBlend is the only software we're aware of that offers stacking using full Bayesian inference out-of-the-box.
Most model averaging methods are closely related to mixture modelling. Traditional Bayesian model averaging (BMA), for instance, averages models, \(\mathcal{M} = \{M_{1}, ..., M_{K}\}\), based on their posterior model probabilities, \(p(M_{k} \mid y) = \frac{p(y \mid M_{k}) p(M_{k})}{\sum_{k = 1}^{K} p(y \mid M_{k}) p(M_{k})}\), where the posterior model probabilities are commonly calculated as a function of the marginal likelihood, \(\int p(y \mid \theta, M_k) p(\theta_k \mid M_k) p(\theta_k) d\theta_k\), and user-defined prior model probabilities, \(P(M_k \mid y)\) (Hoeting et al., 1999, Hinne et al., 2020 ). However, the posterior model weights can also be estimated directly by making \(M_k\) a parameter of a mixture model (Kruschke, 2014 Keller & Kamary, 2017). Notice that the mixture modelling/BMA approach allocates posterior weight to models only within the space of candidate models. This makes most sense when the true model can be conceived as one of the candidate models, a problem that is referred to as \(\mathcal{M}\)-closed.
Model averaging approaches implemented in BayesBlend and elsewhere, however, differ from the mixture modelling approach by splitting the problem into two separate steps: one to estimate the quantity of interest from each candidate model, and another to estimate the model weights and blend predictions. In the first step, it is common to use cross-validation to obtain estimates of candidate model predictive performance that generalizes to data not seen in the training set, effectively opening up the set of candidate models being compared. This is partciularly powerful in \(\mathcal{M}\)-open problems, where the true model is not one of the candidate models because the true model is unknown, intractable or too complex to reify (Yao et al., 2018). Two-step model averaging methods have the additional benefits of being computationally faster and less prone identifiability and convergence problems which are common in mixture modeling.
Simulation example
In this example, we simulate univariate data \(y\) of size \(N = 50\) from a normal distribution with mean \(\alpha\) and, independently, variables \(\mathbf{X} = (\mathbf{x}_{1}, \mathbf{x}_{2})'\) from a standard normal distribution:
We also simulate \(\tilde{N} = 50\) test data from the same model.
We then use three candidate models that each incorrectly presume that the variables \(\mathbf{x}_1\) and \(\mathbf{x}_2\) are linearly related to the response variable, \(\mathbf{y}\).
Model 1 includes a single predictor \(\mathbf{x}_{1} = (x_{11}, x_{21}, ..., x_{i1}, ..., x_{N1})\):
Model 2 includes a single predictor \(\mathbf{x}_{2}\):
Model 3 includes both predictors:
The mixture model fits all three models simultaneously, where we include a discrete parameter \(z\), which indicates whether to sample from model \(M_{1}\), \(M_{2}\) or \(M_{3}\) above. This parameter is given a a categorical distribution with Dirichlet probabilities \(\mathbf{w}\):
In practice, the categorical parameter, \(z\), is marginalized out of the likelihood expression:
and we recover the posterior \(p(z \mid y, u_{zi}, w_{z})\) after model fitting.
We simulate \(S = 100\) data sets of training and test data and then, for each simulation \(s\), we:
- Fit the mixture model and evaluate the log predictive densities of
y_tilde
from each mixture component, which we denote as \(\log{p(\tilde{y} \mid M_{k})}\). - Fit the independent regression models and evaluate both \(\log{p(y \mid M_{k})}\) and \(\log{p(\tilde{y} \mid M_{k})}\).
- Estimate the Pareto-smoothed importance sampling (PSIS) leave-one-out (LOO) cross-validation predictive densities of \(\log{p(y \mid M_{k})}\).
- Fit the pseudo-BMA, pseudo-BMA+, stacking-mle (using maximum-likelihood optimization) and stacking-bayes (using full Bayesian inference) to the PSIS-LOO log densities from step 3 to estimate the optimal weights for blending.
- Blend \(\log{p(\tilde{y} \mid M_{k})}\) according to the weights from the mixture model and those estimated in step 3. We always take the expectation of the weights in cases where we have a posterior of \(p(M_k \mid y)\).
- Calculate the ELPD from the blended out-of-sample log densities to compare the models.
Below, we show the main parts of our workflow and the results. To reproduce the full workflow, the code can be found here.
Simulating the data
We simulate data using the function below,
which returns the tuple of training data
(y, X
) and test data (y_tilde
, X_tilde
).
from typing import Optional, Tuple
import arviz as az
import numpy as np
import cmdstanpy as csp
import bayesblend as bb
Data = Tuple[np.ndarray, np.ndarray]
SEED = 1234
K = 3
P = 2
N = 50
N_tilde = 50
rng = np.random.default_rng(SEED)
def simulate_data(seed: Optional[int] = None) -> Tuple[Data, Data]:
alpha = rng.normal()
sigma = 1
X = rng.normal(size=(N, P))
X_tilde = rng.normal(size=(N_tilde, P))
y = rng.normal(alpha, sigma, size=N)
y_tilde = rng.normal(alpha, sigma, size=N_tilde)
return ((y, X), (y_tilde, X_tilde))
Mixture model
We can use Stan to fit the implied mixture model between candidate models.
In the generated quantities section, we evaluate the log predictive
densities on y_tilde
from each model separately, \(\log{p(\tilde{y} \mid M_{k})}\),
not accounting for the weights at this time. We also derive the posterior
model probabilities using Bayes' rule in the generated quantities (see
the Stan manual entry on Finite mixtures. The weights are applied using
BayesBlend later.
data {
int<lower=0> N;
int<lower=0> N_tilde;
int<lower=2> K;
int<lower=1> P;
matrix[N, P] X;
matrix[N_tilde, P] X_tilde;
vector[N] y;
vector[N_tilde] y_tilde;
}
parameters {
vector[K] alpha;
vector[P + 2] beta;
simplex[K] w;
}
transformed parameters {
vector[K] lps = log(w);
for(i in 1:N) {
lps = [
lps[1] + normal_lpdf(y[i] | alpha[1] + beta[1] * X[i,1], 1),
lps[2] + normal_lpdf(y[i] | alpha[2] + beta[2] * X[i,2], 1),
lps[3] + normal_lpdf(y[i] | alpha[3] + X[i] * beta[3:], 1)
]';
}
}
model {
alpha ~ std_normal();
beta ~ std_normal();
target += log_sum_exp(lps);
}
generated quantities {
matrix[N_tilde, K] log_lik;
// posterior model probability
simplex[K] pmp;
for(k in 1:K) {
pmp[k] = exp(
lps[k] - log_sum_exp(lps)
);
}
for(j in 1:N_tilde) {
log_lik[j] = [
normal_lpdf(y_tilde[j] | alpha[1] + beta[1] * X_tilde[j,1], 1),
normal_lpdf(y_tilde[j] | alpha[2] + beta[2] * X_tilde[j,2], 1),
normal_lpdf(y_tilde[j] | alpha[3] + X_tilde[j] * beta[3:], 1)
];
}
}
This mixture can be fit with the function below:
def fit_mixture(train, test):
mixture = csp.CmdStanModel(stan_file="mixture.stan")
y, X = train
y_tilde, X_tilde = test
fit = mixture.sample(
data={
"N": len(y),
"N_tilde": len(y_tilde),
"P": P,
"K": K,
"X": X[:, :P],
"X_tilde": X_tilde[:, :P],
"y": y,
"y_tilde": y_tilde,
},
inits=0,
seed=SEED,
)
return fit
Regression models
The regression models are all fit in Stan using the following
model.
In the generated quantities
section,
we evaluate the log predictive densities on both the in-sample
and out-of-sample data. The in-sample log-densities are later used
later to estimate the weights. In the mixture model above,
the weights are estimated as part of the model.
data {
int<lower=0> N;
int<lower=0> N_tilde;
int<lower=1> P;
matrix[N, P] X;
matrix[N_tilde, P] X_tilde;
vector[N] y;
vector[N_tilde] y_tilde;
}
parameters {
real alpha;
vector[P] beta;
}
transformed parameters {
vector[N] mu = alpha + X * beta;
}
model {
alpha ~ std_normal();
beta ~ std_normal();
y ~ normal(mu, 1);
}
generated quantities {
vector[N] log_lik;
vector[N_tilde] log_lik_tilde;
for(i in 1:N)
log_lik[i] = normal_lpdf(y[i] | mu[i], 1);
for(j in 1:N_tilde) {
real mu_tilde = alpha + X_tilde[j] * beta;
log_lik_tilde[j] = normal_lpdf(y_tilde[j] | mu_tilde, 1);
}
}
Again, the regressions can be fit with the following function:
def fit_regressions(train, test):
y, X = train
y_tilde, X_tilde = test
predictors = [(X[:, [*p]], X_tilde[:, [*p]]) for p in ([0], [1], [0, 1])]
fits = [
regression.sample(
data={
"N": N,
"N_tilde": N_tilde,
"P": x.shape[1],
"X": x,
"X_tilde": x_tilde,
"y": y,
"y_tilde": y_tilde,
},
seed=SEED,
)
for (x, x_tilde) in predictors
]
return fits
Estimating the weights and blending
For the regression models, we estimate the
PSIS-LOO log densities, as recommended
by
Yao et al. (2018), using arviz
.
regressions = fit_regressions(train, test)
idata = [az.from_cmdstanpy(fit) for fit in regressions]
loo_i = [az.loo(i).loo_i.values for i in idata]
Next, we apply our blending models using BayesBlend.
For the mixture model, we use the SimpleBlend
model to blend the results using the previously-estimated
weights.
# The out-of-sample log likelihood draws
# to blend for the regressions
pred_draws = {
f"fit{i}": bb.Draws(
log_lik=fit.log_lik_tilde,
)
for i, fit in enumerate(regressions)
}
# Mixture model blend
mix = bb.SimpleBlend(
{f"fit{i}": bb.Draws(log_lik=mixture.log_lik[..., i]) for i in range(K)},
weights={f"fit{i}": pmp for i, pmp in enumerate(mixture.pmp.T)},
)
mix_blend = mix.predict()
# Pseudo-BMA
pbma = bb.PseudoBma.from_lpd(
loo_fits,
bootstrap=False,
)
pbma.fit()
pbma_blend = pbma.predict(pred_draws)
# Pseudo-BMA+
pbma_plus = bb.PseudoBma.from_lpd(loo_fits, seed=SEED)
pbma_plus.fit()
pbma_plus_blend = pbma_plus.predict(pred_draws)
# Stacking (mle)
stack = bb.MleStacking.from_lpd(loo_fits)
stack.fit()
stack_blend = stack.predict(pred_draws)
# Stacking (bayes)
stack_bayes = bb.BayesStacking.from_lpd(loo_fits, seed=SEED)
stack_bayes.fit()
stack_bayes_blend = stack_bayes.predict(pred_draws)
Evaluate the predictions
To evaluate the ELPD for each blend,
we can simply call Draws.lpd.sum()
on each model.
Below is the comparison plot for the models. The top panel shows the mean ELPDs for each model (blue points) with their 95% percentile intervals (vertical lines) across the 100 simulations. The gray open circles are all the ELPDs from each model.
The second panel shows the mean and 95% percentile intervals across the 100 simulations of the estimated weights from each model.
As the ELPD values show, the BayesBlend model averaging methods all perform better than the mixture modelling approach for this example, indicating that the use of PSIS-LOO estimates to obtain the model weights offers a non-trivial performance improvement. The differences between pseudo-BMA, pseudo-BMA+, stacking-mle and stacking-bayes are very small for this example, although this might not always be the case (see Yao et al., 2018).
The estimated weights across methods mostly weight models 1 and 2 equally, and model 3 the least plausible. Notice that the stacking-mle model has much wider percentile intervals because it often places most weight on either model 1 or model 2 rather than dividing weight equally, which is expected from a pure optimization process if models 1 and 2 provide similar predictions. The Bayesian stacking implementation (stacking-bayes) places a prior over the weights, however, which means its weights are much more uniform between models than its MLE counterpart.
The pseudo-BMA weights' percentile intervals are slightly wider than the pseudo-BMA+ weights, which is also expected since pseudo-BMA+ will regularize weights slightly away from 0 and 1 compared to pseudo-BMA as a consequence of Bayesian bootstrapping of log predictive densities.