Offtopia — nothing personal

How Data Scientists Fail

I am going to job interviews, again. This time, a frequent request is: “Tell us about a failed project”. Of course, I never fail as a data scientist, how could I? A data science task involves a combination of domain knowledge and data, neither is held or produced by me, and a question someone else wants an answer to. All I do as a data scientist is encoding the domain knowledge as a model, updating the model’s latent variables based on the data, and computing a quantitative answer to the question. There are ways to ensure adequacy of the model, check convergence of inference, and express uncertainty of the answer. Just doing all these steps by the book ensures that there is absolutely no way to fail. Consider the task of classifying hand-written digits — although different models may have different accuracy, there is no way to ‘fail’ as long as one does things as taught. Or is there?

Read more →

Double Speed Replay

Thanks to the plague, we teach over Zoom, and have our lectures recorded. Many students do not attend in real time and instead replay the recordings at their convenience, and at 2x speed.

It is easy to label the students as superficial, but double speed replay has a perfectly valid though slightly embarrassing, for us the teachers, justification. When I was trained in public speaking, I was taught this basic technique for preparing a time-framed lecture:

Read more →

How To Train Your Program

arXiv | code

The ultimate Bayesian approach to learning from data is embodied by hierarchical models. In a hierarchical model, each observation or a group of observations $y_i$ corresponding to a single item in the data set is conditioned on a parameter $\theta_i$, and all parameters are conditioned on a hyperparameter $\tau$: \begin{equation} \begin{aligned} \tau & \sim H \\ \theta_i & \sim D(\tau) \\ y_i & \sim F(\theta_i) \end{aligned} \label{eqn:hier} \end{equation}

Read more →

Stochastic conditioning

Probabilistic programs implement statistical models. Commonly, probabilistic programs follow the Bayesian generative pattern:

\begin{equation} \begin{aligned} x & \sim \mathrm{Prior} \\ y & \sim \mathrm{Conditional}(x) \end{aligned} \end{equation}

  • A prior is imposed on the latent variable $x$.
  • Then, observations $y$ are drawn from a distribution conditioned on $x$.

The program and the observations are passed to an inference algorithm which infers the posterior of latent variable $x$.

Read more →

There Are No Outliers

Gaussian processes are great for time series forecasting. The time series does not have to be regular — ‘missing data’ is not an issue. A kernel can be chosen to express trend, seasonality, various degrees of smoothness, non-stationarity. External predictors can be added as input dimensions. A prior can be chosen to provide a reasonable forecast when little or even no data is available.

Read more →

A Go Transgression

Go gives the programmer introspection into every aspect of the language, and of a running program. But to one thing the programmer does not have access, and it is the goroutine identifier. Because the day the programmers know the goroutine identifier, they create goroutine-local storage through shared access and mutexes, and shall surely die.

Read more →

Go Programs That Learn

There are so many probabilistic programming languages that it is hard to choose one. Because it is so hard to choose one, a probabilistic programmer has two options:

  • invent a new probabilistic programming language, or
  • write probabilistic programs in a regular programming language.

The former choice is easier to make, that’s why there are so many different probabilistic programming languages. But writing programs is so much easier in a regular language, and programs in regular languages can do many useful things. Any modern general-purpose programming language is suitable for probabilistic programming. Take Go, for example.

Read more →

A Small Program Can Be a Big Challenge

[Poster: html, pdf]

A good part of today’s internet content is created and shaped for delivering advertisements. Internet pages are interconnected by links, and a visitor is likely to open multiple pages from the same publisher. After a while, visitors leave the web site, either due to clicking on an advertisement or just because they get bored and switch to other content or activity.

Read more →

How to Hug a Data Scientist

Sometimes, a data scientist is the first engineer in a software project. More often though a data scientist joins the team when there is working code, ready for deploying or even deployed. Here is how the latter case rolls out:

We write a piece of software. Thanks to continous delivery, we fix our bugs quickly and release new improved versions on time. Our code is fully tested, easy to change, and pieces fit each other smoothly.

Read more →