Posts

How Data Scientists Fail

Sun Sep 04, 2022 by David Tolpin

I am going to job interviews, again. This time, a frequent request is: “Tell us about a failed project”. Of course, I never fail as a data scientist, how could I? A data science task involves a combination of domain knowledge and data, neither is held or produced by me, and a question someone else wants an answer to. All I do as a data scientist is encoding the domain knowledge as a model, updating the model’s latent variables based on the data, and computing a quantitative answer to the question. There are ways to ensure adequacy of the model, check convergence of inference, and express uncertainty of the answer. Just doing all these steps by the book ensures that there is absolutely no way to fail. Consider the task of classifying hand-written digits — although different models may have different accuracy, there is no way to ‘fail’ as long as one does things as taught. Or is there?

Double Speed Replay

Thu Aug 19, 2021 by David Tolpin

Thanks to the plague, we teach over Zoom, and have our lectures recorded. Many students do not attend in real time and instead replay the recordings at their convenience, and at 2x speed.

It is easy to label the students as superficial, but double speed replay has a perfectly valid though slightly embarrassing, for us the teachers, justification. When I was trained in public speaking, I was taught this basic technique for preparing a time-framed lecture:

How To Train Your Program

Tue Aug 10, 2021 by David Tolpin

arXiv | code

The ultimate Bayesian approach to learning from data is embodied by hierarchical models. In a hierarchical model, each observation or a group of observations $y_i$ corresponding to a single item in the data set is conditioned on a parameter $\theta_i$, and all parameters are conditioned on a hyperparameter $\tau$: \begin{equation} \begin{aligned} \tau & \sim H \\ \theta_i & \sim D(\tau) \\ y_i & \sim F(\theta_i) \end{aligned} \label{eqn:hier} \end{equation}

Stochastic conditioning

Mon Aug 09, 2021 by David Tolpin

Probabilistic programs implement statistical models. Commonly, probabilistic programs follow the Bayesian generative pattern:

\begin{equation} \begin{aligned} x & \sim \mathrm{Prior} \\ y & \sim \mathrm{Conditional}(x) \end{aligned} \end{equation}

A prior is imposed on the latent variable $x$.
Then, observations $y$ are drawn from a distribution conditioned on $x$.

The program and the observations are passed to an inference algorithm which infers the posterior of latent variable $x$.

BDA Model Too Tough for Stan

Sat Aug 29, 2020 by David Tolpin

I taught a course on Bayesian data analysis, closely following the book by Andrew Gelman et al., but with the twist of using probabilistic programming, either Stan or Infergo, for all examples and exercises. However, it turned out that at least one important problem in the book is beyond the capabilities of Stan.

There Are No Outliers

Mon Jun 03, 2019 by David Tolpin

Gaussian processes are great for time series forecasting. The time series does not have to be regular — ‘missing data’ is not an issue. A kernel can be chosen to express trend, seasonality, various degrees of smoothness, non-stationarity. External predictors can be added as input dimensions. A prior can be chosen to provide a reasonable forecast when little or even no data is available.

A Go Transgression

Tue Apr 02, 2019 by David Tolpin

Go gives the programmer introspection into every aspect of the language, and of a running program. But to one thing the programmer does not have access, and it is the goroutine identifier. Because the day the programmers know the goroutine identifier, they create goroutine-local storage through shared access and mutexes, and shall surely die.

Go Programs That Learn

Wed Nov 28, 2018 by David Tolpin

There are so many probabilistic programming languages that it is hard to choose one. Because it is so hard to choose one, a probabilistic programmer has two options:

invent a new probabilistic programming language, or
write probabilistic programs in a regular programming language.

The former choice is easier to make, that’s why there are so many different probabilistic programming languages. But writing programs is so much easier in a regular language, and programs in regular languages can do many useful things. Any modern general-purpose programming language is suitable for probabilistic programming. Take Go, for example.

A Small Program Can Be a Big Challenge

Wed Aug 15, 2018 by David Tolpin

[Poster: html, pdf]

A good part of today’s internet content is created and shaped for delivering advertisements. Internet pages are interconnected by links, and a visitor is likely to open multiple pages from the same publisher. After a while, visitors leave the web site, either due to clicking on an advertisement or just because they get bored and switch to other content or activity.

How to Hug a Data Scientist

Mon Apr 09, 2018 by David Tolpin

Sometimes, a data scientist is the first engineer in a software project. More often though a data scientist joins the team when there is working code, ready for deploying or even deployed. Here is how the latter case rolls out:

We write a piece of software. Thanks to continous delivery, we fix our bugs quickly and release new improved versions on time. Our code is fully tested, easy to change, and pieces fit each other smoothly.
Read more →