I am going to job interviews, again. This time, a frequent
request is: “Tell us about a failed project”. Of course, I never
fail as a data scientist, how could I? A data science task
involves a combination of domain knowledge and data, neither is
held or produced by me, and a question someone else wants an
answer to. All I do as a data scientist is encoding the domain
knowledge as a model, updating the model’s latent variables
based on the data, and computing a quantitative answer to the
question. There are ways to ensure adequacy of the model, check
convergence of inference, and express uncertainty of the
answer.
Just doing all these steps by the book ensures that there is
absolutely no way to fail. Consider the task of classifying
hand-written digits —
although different models may have different accuracy, there is
no way to ‘fail’ as long as one does things as taught. Or is
there?
Read more →
Thanks to the
plague, we
teach over Zoom, and have our lectures
recorded. Many students do not attend in real time and instead
replay the recordings at their convenience, and at 2x speed.
It is easy to label the students as superficial, but double
speed replay has a perfectly valid though slightly embarrassing,
for us the teachers, justification. When I was trained in public
speaking, I was taught this basic technique for preparing a
time-framed lecture:
Read more →
arXiv | code
The ultimate Bayesian approach to learning from data is embodied by
hierarchical models. In a hierarchical model,
each observation or a group of observations $y_i$ corresponding
to a single item in the data set is conditioned on a parameter
$\theta_i$, and all parameters are conditioned on a
hyperparameter $\tau$:
\begin{equation}
\begin{aligned}
\tau & \sim H \\
\theta_i & \sim D(\tau) \\
y_i & \sim F(\theta_i)
\end{aligned}
\label{eqn:hier}
\end{equation}
Read more →
Probabilistic programs implement statistical models. Commonly,
probabilistic programs follow the Bayesian generative pattern:
\begin{equation}
\begin{aligned}
x & \sim \mathrm{Prior} \\
y & \sim \mathrm{Conditional}(x)
\end{aligned}
\end{equation}
- A prior is imposed on the latent variable $x$.
- Then, observations $y$ are drawn from a distribution conditioned
on $x$.
The program and the observations are passed to an inference
algorithm which infers the posterior of latent variable $x$.
Read more →
I taught a course on Bayesian data analysis, closely following
the book by Andrew Gelman et
al., but with the
twist of using probabilistic programming, either
Stan or Infergo,
for all examples and exercises. However, it turned out that at
least one important problem in the book is beyond the
capabilities of Stan.
Read more →
Gaussian processes are great for time series forecasting. The
time series does not have to be regular — ‘missing data’ is
not an issue. A kernel can be chosen to express trend,
seasonality, various degrees of smoothness, non-stationarity.
External predictors can be added as input dimensions. A prior
can be chosen to provide a reasonable forecast when little
or even no data is available.
Read more →
Go gives the programmer introspection into
every aspect of the language, and
of a running program. But to one
thing the programmer does not have access, and it is the
goroutine identifier. Because the day the programmers know the
goroutine identifier, they create goroutine-local storage
through shared access and mutexes, and shall surely
die.
Read more →
There are so many probabilistic programming
languages that
it is hard to choose one. Because it is so hard to choose one,
a probabilistic programmer has two options:
- invent a new probabilistic programming language, or
- write probabilistic programs in a regular programming
language.
The former choice is easier to make, that’s why there are so
many different probabilistic programming languages. But writing
programs is so much easier in a regular language, and programs
in regular languages can do many useful things. Any modern
general-purpose programming language is suitable for
probabilistic programming. Take Go, for
example.
Read more →
[Poster: html, pdf]
A good part of today’s internet content is created and shaped
for delivering advertisements. Internet pages are interconnected
by links, and a visitor is likely to open multiple pages from
the same publisher. After a while, visitors leave the web site,
either due to clicking on an advertisement or just because they
get bored and switch to other content or activity.
Read more →
Sometimes, a data scientist is the first engineer in a software
project. More often though a data scientist joins the team when
there is working code, ready for deploying or even deployed.
Here is how the latter case rolls out:
We write a piece of software. Thanks to continous delivery,
we fix our bugs quickly and release new improved versions on
time. Our code is fully tested, easy to change, and pieces
fit each other smoothly.
Read more →