I taught a course on Bayesian data analysis, closely following the book by Andrew Gelman et al., but with the twist of using probabilistic programming, either Stan or Infergo, for all examples and exercises. However, it turned out that at least one important problem in the book is beyond the capabilities of Stan.
This case study is inspired by Section 7.6 in Bayesian Data Analysis, originally a paper published in 1983 by Ronald Rubin. The original case study evaluated Bayesian inference on the problem of estimation of total population of 804 municipalities of New York state based on a sample of 100 municipalities. Two samples were given, with different summary statistics, and power-transformed normal model was fit to the data to make predictions consistent among the samples. The authors of the original case study apparently had access to the full data set (populations of each of 100 municipalities in both samples). However, only summary description of the samples appears in the publication: the mean, the standard deviation, and the quantiles:
Population | Sample 1 | Sample 2 | |
---|---|---|---|
total | 13,776,663 | 1,966,745 | 3,850,502 |
mean | 17,135 | 19,667 | 38,505 |
sd | 139,147 | 142,218 | 228,625 |
lowest | 19 | 164 | 162 |
5% | 336 | 308 | 315 |
25% | 800 | 891 | 863 |
median | 1,668 | 2,081 | 1,740 |
75% | 5,050 | 6,049 | 5,239 |
95% | 30,295 | 25,130 | 41,718 |
highest | 2,627,319 | 1,424,815 | 1809578 |
This is a common way to summarize a data sample, and Pandas, a Python library for data analysis, even has a built-in function which produces such summary for data. Here, we show how such summary description can be used to perform Bayesian inference, with the help of stochastic conditioning.
The sample summary can be divided into three parts:
- the sample size;
- the mean and the standard deviation;
- the quantiles.
The sample size tells us how much information we have about the distribution. The mean and the standard deviation describe the distribution parametrically — if we knew the formula of the probability density (parameterized by mean and standard deviation), we could substitute these statistics into the formula to fully specify the distribution. Finally, the quantiles approximate the distribution shape non-parametrically — we can sample from each quantile proportionally to the probability mass of the quantile to approximate samples from the distribution.
The original case study with comparing normal and log-normal models, and finally fit a truncated three-parameter power-transformed normal distribution to the data, which helped to reconcile conclusions based on each of the samples while producing results consistent with the total population. Here, we use a model with log-normal sampling distribution and normal-inverse-Gamma prior on the mean and variance. To complete the model, we stochastically condition the model on the piecewise-uniform distribution according to the quantiles: $$z_{1\ldots\mathrm{n}} \leftarrow \mathrm{Quantiles}$$ $$m \sim \mathrm{Normal}(\mathrm{mean}, \frac {\mathrm{sd}} {\sqrt{\mathrm{n}}}),\quad s^2 \sim \mathrm{InvGamma}(\frac {\mathrm{n}} 2, \frac {\mathrm{n}} 2 \mathrm{sd}^2)$$ $$\sigma = \sqrt{\log \left(\frac {s^2} {m^2} + 1\right)},\quad\mu = \log m - \frac {\sigma^2} 2$$ $$z_{1\ldots\mathrm{n}} \sim \mathrm{LogNormal}(\mu, \sigma)$$ By $z_{1\ldots\mathrm{n}}$ we denote $\mathrm{n}$ samples of $z$ from the distribution defined by the quantiles. Here is the model in Infergo:
func (m *Model) Observe(x []float64) float64 {
mean := x[0]
vari := math.Exp(x[1])
logp := x[1] +
Normal.Logp(m.Mean, math.Sqrt(m.Vari/m.N), mean) +
Gamma.Logp(m.N/2, m.N/2*m.Vari, 1/vari)
sigma := math.Sqrt(math.Log(vari/(mean*mean) + 1))
mu := math.Log(mean) - 0.5*sigma*sigma
for i := 0.; i != m.N; i++ {
z := <-m.Z
logp += -math.Log(z) + Normal.Logp(mu, sigma, math.Log(z))
}
return logp
}
A straightforward way to sample from the quantiles is to sample a quantile proportionally to the probability mass, and then sample a value uniformly from the range of values in the quantile:
func RandQuantile(q [][2]float64) float64 {
var p, z float64
for {
p = rand.ExpFloat64()
if p < 1 {
break
}
}
for i := 1; i != len(q); i++ {
if q[i][0] >= p {
z = q[i-1][1] + rand.Float64()*(q[i][1]-q[i-1][1])
break
}
}
return z
}
This is all we need to define the stochastically conditioned probabilistic model in Infergo (the complete code and data are on BitBucket). We fit the model using sgHMC. The posterior predictive distributions from both samples are quite similar and consistent with the summary of the total population:
Sample 1 | Sample 2 | |
---|---|---|
mean | 18,646 | 23,655 |
5% | 82 | 69 |
median | 2,389 | 2,395 |
95 | 66,381 | 80,296 |
The model can be improved by replacing log-normal with power-transformed normal distribution. However, the point of this case study has been to show how combining parametric and non-parametric summaries can be easily expressed with stochastic conditioning. It is not clear to us how to express a probabilistic program for this study otherwise, using deterministic conditioning only.