The ultimate Bayesian approach to learning from data is embodied by
hierarchical models. In a hierarchical model,
each observation or a group of observations $y_i$ corresponding
to a single item in the data set is conditioned on a parameter
$\theta_i$, and all parameters are conditioned on a
hyperparameter $\tau$:
\begin{equation}
\begin{aligned}
\tau & \sim H \\

\theta_i & \sim D(\tau) \\

y_i & \sim F(\theta_i)
\end{aligned}
\label{eqn:hier}
\end{equation}

A hierarchical model can be thought of as a way of inferring, or
‘learning’, the prior of each $\theta_i$ from all observations
in the data set. Consider the following example problem:
multiple boxes are randomly filled by $K$ marbles from a bag
containing a mixture of blue and white marbles. We are presented
by a few draws with replacement from each of the boxes, $y_{ij}$
being the $j$th draw from the $i$th box; our goal is to infer
the number of blue marbles $\theta_i$ in each box. Intuitively,
since the boxes are filled from the same bag, the posterior
distribution of $\theta_i$ should account both for draws from
the $i$th box and, indirectly, for draws from all other boxes.
This is formalized by the following hierarchical model:
\begin{align}
\begin{aligned}
\tau & \sim \mathrm{Beta}(1, 1) \\

\theta_i & \sim \mathrm{Binomial}(K, \tau) \\

y_{ij} & \sim \mathrm{Bernoulli}\left(\frac {\theta_i} K\right)
\end{aligned}
\label{eqn:marbles}
\end{align}

The above model learns from the data in the sense that
inference for each box is influenced by draws from all boxes.
However, learning from *training data* to improve
inference on *future data* with a hierarchical model is
computationally inefficient — if a new box is presented, one
has to add observations of the new box to the previously
available data and re-run inference on the extended data set.
Inference performance can be improved by employing data
subsampling, but the whole
training data set still needs to be kept and made accessible to the
inference algorithm. A hierarchical model cannot ‘compress’, or
summarize, training data for efficient inference on future
observations. Is there a way to learn from data in probabilistic
programs which is both Bayesian in nature and computationally
efficient?

## Problem: Learning from Data

The challenge we tackle here is re-using inference outcomes on the training data set for inference on new data. Formally, population $\mathcal{Y}$ is a set of sets $y_i \in Y$ of observations $y_{ij} \in y_i$. Members of each $y_i$ are assumed to be drawn from a known distribution $F$ with unobserved parameter $\theta_i$, $y_{ij} \sim F(\theta_i)$. $\theta_i$ are assumed to be drawn from a common distribution $H$. Our goal is to devise a scheme that, given a subset $Y \subset \mathcal{Y}$, the _training set_, infers the posterior distribution of $\theta_k|Y, y_k$ for any $y_k \in \mathcal{Y}$ in a shorter amortized time than running inference on a hierarchical model $Y \cup {y_k}$. By amortized time we mean here average time per $y_k,,k \in 1:K$ as $K \to \infty$.

In other words, we look for a scheme that works in two stages. At the first stage, inference is performed on the training set $Y$ only. At the second stage, the inference outcome of the first stage is used, together with $y_k$, to infer $\theta_k|Y, y_k$. We anticipate a scheme that ‘compresses’ the training set at the first stage, resulting in a shorter running time of the second stage. Such scheme bears similarity to the conventional machine learning paradigm: an expensive computation on the training data results in shorter running times on new data.

## Main Idea: Stump and Fungus

In quest of devising such a scheme, we make two observations which eventually help us arrive at a satisfactory solution:

- In Bayesian modelling, information about data is usually conveyed through conditioning of the model on various aspects of the data.
- In a hierarchical model, influence of the $i$th group of observations on the hyperparameters $\tau$ and, consequently, on other groups, passes exclusively through the group parameters $\theta_i$.

If, instead of conditioning on training data $y_i$, we could
condition on parameters $\theta_i$ corresponding to the training
data, then we could perform inference on new data item $y_k$ at
a lower time and space complexity. Continuing the well known
analogy between a hierarchical model and a tree, with the
hyperparameter $\tau$ at the root and observations ${y_i}_j$ in
the leaves, we can liken a model which receives all $\theta_i$
of the training data and new data item $y_k$ as a *stump* (the
hierarchical model with the trunk cut off just after the
hyperparameters) and a *fungus* growing on the stump —
the new data item. The problem is, of course, that we infer
**distributions**, rather than fixed values, of $\theta_i$,
and the model must be, somewhat unconventionally, conditioned on
the distributions of $\theta_i$.

However, a recently introduced notion of stochastic
conditioning
makes conditioning on distributions of $\theta_i$ possible, both
theoretically and in the practical case when the posteriors of
$\theta_i$ are approximated using Monte Carlo samples. Moreover,
conditioning the model both *stochastically* on the posterior
distributions of $\theta_i$ on training data and
*deterministically* on new data $y_k$ yields the same posterior
distribution of $\theta_k$ as inference on the full hierarchical
model. Based on this, we propose the ‘stump-and-fungus’ pattern
for learning from data in probabilistic programs:

- Training is accomplished through inference on a hierarchical model, in the usual way.
- Training outcomes are summarized as a collection of samples $\tilde \theta$, representing the mixture distribution of $\theta_i$ of all groups.
- For inference on new data item $y$, a stump-and-fungus
model is employed:
\begin{equation}
\begin{aligned}
\tilde \theta & \sim \mathrm{Hierarchical}(Y) \\

—&——————— \\

\tau & \sim H \\

\tilde \theta, \theta &|\tau \sim D(\tau) \\

y&|\theta \sim F(\theta) \end{aligned} \end{equation}

Although two models — hierarchical and stump-and-fungus — are involved in the pattern, the models are in fact two roles fulfilled by the same generative model, combining stochastic conditioning on training data and deterministic conditioning on new data (consisting potentially of multiple data items). This preserves a common practice in machine learning in which the same model is used for both training and inference.

## Case Study: Tumor Incidence in Rats

In this case study, discussed in Chapter 5 of Bayesian Data Analysis, data on tumor incidence in rats in $N=70$ laboratory experiments is used to infer tumor incidence based on outcomes of yet another experiment. A different number of rats $n_i$ was involved in each experiment, and the number of tumor cases $y_i$ was reported. The posteriors of independent models for all experiments are shown in Figure 1.

Figure 1. Posteriors of separate models for all experiments

A schoolbook solution for the problem
is to perform inference on a hierarchical model:
\begin{equation}
\begin{aligned}
\alpha, \beta & \sim \mathrm{Uniform}(0, \infty) \\

p_i&|\alpha, \beta \sim \mathrm{Beta}(\alpha, \beta)
\\

y_i&|p_i \sim Binomial(n_i, p_i)
\end{aligned}
\end{equation}
Inference on this model can be performed
efficiently thanks to summarization of $n$ observations from
$\mathrm{Bernoulli}(p)$ as a single observation from
$\mathrm{Binomial}(n, p)$. In general however, the use of a
hierarchical model would require carrying all observations of
all previous experiments for learned inference on findings of a
new experiment.

The stump-and-fungus pattern is straightforwardly applicable to
the problem:
\begin{equation}
\begin{aligned}
p_{-k} & \sim \mathrm{Hierarchical}(n_{-k}, y_{-k}) \\

—&———————— \\

\alpha, \beta & \sim \mathrm{Uniform}(0, \infty) \\

p&|\alpha, \beta \sim \mathrm{Beta}(\alpha, \beta) \\

y_k&|p_k \sim \mathrm{Binomial}(n_k, p_k)
\end{aligned}
\label{eqn:rats-stump-and-fungus}
\end{equation}
$p_{-k}$ are stochastically observable from the posterior of
the hierarchical model conditioned on all but
the $k$th experiment. Figure 2 shows the posterior distributions
for $p$ inferred on the hierarchical model and through $N+1$
applications of stump-and-fungus~(similar to leave-one-out
cross-validation). The Infergo source
code of the model is provided in the appendix. The inference
was performed with HMC on the hierarchical model and _stochastic
gradient_ HMC on the stump-and-fungus model. 1000 samples were
used to visualize the posterior. One can see that the posteriors
obtained via either method appear to be the same, except for
small discrepancies apparently caused by finite sample size
approximation. The data and code for the case study are
available on
BitBucket.

Hierarchical model | Stump-and-Fungus model |
---|---|

Figure 2: Hierarchical vs Stump-and-Fungus

## Discussion

We presented a probabilistic programming pattern for Bayesian learning from data. The importance of learning from data is well appreciated in probabilistic programming. Along with empirical Bayes, applicable to probabilistic programming as well as to Bayesian generative models in general, probabilistic-programming specific approaches were proposed. Our approach to learning from data in probabilistic programs does not require any particular implementation of probabilistic programming to be used, nor introspection into the structure of probabilistic programs or inference algorithms. Instead, the approach uses inference in ubiquitously adopted hierarchical models for training, and conditioning on observations for incorporation of training outcomes in inference.