Datacast

Episode 73: Datasets for Software 2.0 with Taivo Pungas

Episode Summary

Taivo Pungas is a tech entrepreneur working on a stealth-mode startup. Previously, he built the AI team at Veriff from scratch to 20+ people and contributed to various ML/data roles at Starship and other Estonian startups. On the side, he advises startups and writes a blog at taivo.ai.

Episode Notes

Timestamps

(01:30) Taivo shared briefly about his experience going through the Estonian K-12 system, as argued in his blog post written in Estonian.
(05:34) Taivo described his undergraduate experience studying Computer Science at the University of Tartu and exposing to Machine Learning.
(08:15) Taivo discussed his time interning at Skype and TransferWise.
(10:01) Taivo went over his Master's Degree in Computer Science at ETH Zurich, where he worked on a thesis called "Uncertainty-based active imitation learning" at the Learning and Adaptive Systems Group.
(17:17) Taivo talked about his time working at Starship Technologies as a Perception Engineer.
(21:26) Taivo unpacked the Data Specification Manifesto that entails 3 principles for iteratively solving complex problems.
(27:21) Taivo unpacked "The Two Loops Of Building Algorithmic Products" from his experience at Veriff - an Estonian startup that develops an identity verification platform.
(32:11) Taivo discussed how his team at Veriff developed automation-heavy products.
(36:45) Taivo shared lessons learned as a Product Manager at Veriff: leading the go-to-market strategy, establishing communication between the product and sales division, and building a unique DataOps team that creates good datasets.
(44:31) Taivo described the key characteristics and properties of a tool that can address the whole data annotation workflow (Read his article "Data Loops Are The Bottleneck In Applied AI").
(49:33) Taivo predicted the evolution of the DataOps discipline for AI teams in the upcoming years (Read his article "Your AI Team Needs DataOps").
(54:01) Taivo untangled the relationship between sampling and labeling, and their importance in the AI development process (Read his article "Datasets Carve The Terrain of AI").
(56:36) Taivo talked about the tools that he's most excited about during the transition to Software 2.0.
(01:00:04) Taivo shared his journey thus far as the founder of a stealth startup.
(01:06:21) Taivo revealed insider insights about the #EstonianMafia startup ecosystem.
(01:09:36) Taivo shared the productivity tips that have been most useful to his personal/professional growth.
(01:14:10) Closing segment.

Taivo's Contact

Mentioned Content

Blog Posts

Data Specification Manifesto
"Building Automation-Heavy Products" (April 2019)
"Data Loops Are The Bottleneck In Applied AI" (June 2019)
"Your AI Team Needs DataOps" (July 2020)
"Datasets Carve The Terrain of AI" (Nov 2020)

Talks

"The Two Loops Of Building Algorithmic Products" (April 2019)
"How To Build Your AI Startup" (June 2020)
"Datasets: The Source Code of Software 2.0" (Nov 2020)

People

Andrej Karpathy (The Senior Director of AI at Tesla, who coined the term Software 2.0)
Mike Bostock (The Creator of D3.js)

Book

"Surely You're Joking, Mr. Feynman" (by Richard Feynman)

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are highlights from my conversation with Taivo:

On The Estonian K-12 System

I went to some of the top schools in Estonia for both middle and high school, but I felt pretty bored for some reason. The pace was either too slow or too fast, which meant that I wasn’t engaged and instead spent time misbehaving/disturbing other students. I think the problem, to a large extent, was the format of the school — where classes are lectures. You might as well have a 45-minute video playing to the students and maybe a small number of interactions, but it’s been thoroughly shown that a lecture is not a great way to teach (especially if the students are passively listening). To learn effectively, you have to do a lot of work to understand the materials. For me, I always found it much easier to do it individually.

From statistics, Estonia scores very high on academic performance but low on student happiness. You might guess that this is a tradeoff — you have to put a lot of work into being a student and maybe do not have time to play. But this is not true. Singapore, Switzerland, Japan, and other countries score high on both axes. So I pitched this idea that maybe we shouldn’t stick to the schooling method invented 100 years ago. Perhaps we should arrange it according to the best knowledge we have today, meaning more personalized curricula and more one-on-one interactions. In general, there’s just so much low-hanging fruit, and it was frustrated to see that schools are not better organized.

On Studying at ETH Zurich

One of my favorite courses was large-scale machine learning. It was well presented, taking relatively simple algorithms and proposing ways to apply them to massive datasets. For example, I learned about locality-sensitive hashing (via the implementation within Spotify’s annoy library), which has been helpful in my career.
Another course was based on matrix factorization and the different kinds of analysis that can be done on it. Matrix factorization methods force me to understand the latent structure in the data.
There’s the popular Machine Learning course with over 800 people that focused on statistical ML. The professor was an old-school ML person who taught rigorous mathematical concepts and was not much into neural networks.
Then, I completed a course on multimedia processing about how image and audio data are structured, encoded, and compressed at a high level. Understanding how JPEG encoding works is pretty interesting, and it’s been useful to see the common principles between JPEG encoding and matrix factorization.

My thesis was based on a simple idea: when you are trying to learn something from a teacher, you should have them show you the things they are most uncertain about. I did simple experiments to see whether it’s beneficial to have a teacher show you something that the teacher expects to be available. One thing that I learned from writing up this research was that: I am not suited to academic work. It’s difficult for me to motivate myself to work on things in which the tangible value is on such a long time horizon, and the outcome is so uncertain. My dopamine system is incompatible with academia, so I probably have to stay away from doing a Ph.D. or a very research-oriented project.

I came from Estonia, where it wasn’t challenging for me to get perfect grades in every class. Based on that experience, I saw myself as easily able to learn almost anything. Then I went to ETH, and the expected amount of learning per credit was so much higher. Therefore, I only aimed for the halfway grade between maximum and failing. If I manage that on average, I’d be happy.

Furthermore, I went there with impostor syndrome. I took the same courses with people with CVs full of world-class names, so it was intimidating. But then I realized that they were not mountains above me, and we were roughly on the same level. This was a reinforcement of my confidence.

On The Data Specification Manifesto

Let me simplify the problem we had at Starship Technologies: Given a robot, you want it to cross a road. The most dangerous time in the drive is to figure out whether a car is coming or not. So this is a binary classification problem: is it safe for the robot to cross or not? You can approach this analytically by looking at a bunch of crossings and figuring out some major patterns: cars coming from the left, cars coming from the right, cars that are occluded by bushes. However, if we abstract the problems too much, we lose the detail.

We came up with a different approach to curating a good collection of examples that we intended to solve. This is common in machine learning as we always evaluate our models directly on the data. It turns out that this is not very common in software engineering. When you write tests, you might write out typical scenarios as kinds of unit tests. Then you solve these unit tests in testing and development. When you take the approach of looking at the data points and solving them directly (instead of an abstraction layer in between), you have a really good empirical loop of solving the problem.

On “The Two Loops Of Building Algorithmic Products”

The origin of this talk is quite related to the Data Specification Manifesto idea. When building up the AI team at Veriff, most of our problems actually surrounded the data. It was pretty easy to find a standard neural network model to train, build a micro-service, or deploy our models. The hardest part was making sure that we had good datasets available in a timely way. The talk was an experiment around the idea of organizing the loops around data, as inspired by the concept of Software 2.0 (coined by Andrej Karpathy).

In software 2.0, you still have the source code used to train your model. But the data defines the behavior of your ML system. If you think about it this way, it makes a huge amount of sense to focus much more on the data, not on the code to train your model. Somehow deep learning people tend to focus more on tinkering with the best model — like revamping the architecture to squeeze out 0.5% improvement. This can only be justified if you already have good datasets or if you have exhausted all other opportunities from working with data. Or if you work on super cutting-edge technologies (like most self-driving companies do). But in most cases, if you work on standard tasks like image classification or object detection, you can find many implemented architectures on GitHub or your standard neural network library. You don’t need to roll your own architecture because you’re unlikely to find a better one than those published before.

Thus, it would be best if you put effort into making a really good dataset. I started to think about how to use the two development loops to improve the dataset (in the process of deploying a new model and serving the result). In software 1.0, you release the software, find bugs, and change the source code. Similarly, in software 2.0, you release the software, find new cases, add them to your dataset, find bugs in your dataset, and change them. These are activities that you can focus on explicitly to make better datasets and build a data engine (a machine that produces good datasets for software 2.0).

On Developing Automation-Heavy Products at Veriff

The product is an API where you send a picture of your driver’s license and get back a response containing a binary decision about whether it looks fraudulent or not. A lot of data is extracted from the license (such as your name or your date of birth). Banks or fintech companies are legally obligated to obtain that information in order to verify your identity. You might already see that this is a perfect supervised learning problem without a time dimension: given completely independent images, find the correct answer. Humans can give counter labels, which makes this process even easier. The context here was that we have to reduce the amount of human work that goes into processing the license documents. In one extreme, for every image that comes in, a human has to make all the decisions. On the other extreme, all the decisions are made automatically. This was a pure case study of automation.

The first tip is defining a unit of automation: You have to figure out the appropriate level at which to automate. In the identity verification case, you don’t get just one image. You get images of the front side, the backside, and the person’s face. So a natural unit of automation is the combination of 3 images. This unit of automation might be obvious, but there might be different options, and you might have to choose between those.
The second tip is staying algorithm-agnostic: This is relatively straightforward. You might have a tendency to go for a fancy algorithm, whereas something relatively simple might suffice. Instead of having a neural network that tells you what kind of document it is, you can use a standard OCR library and look for the text like “Estonia” to figure out what country it is. It won’t be accurate enough but would give you a decent baseline possibly. And it’s probably 100x faster to implement.
The third tip is building a final decision maker: The problem with a distributed decision-making system is that you need an aggregation point of all the decisions at some point. If you don’t think about it beforehand, you might have a hacky aggregation race condition — where you have two algorithms running in parallel and creating a huge mess.

On Being a Product Manager at Veriff

There was no handbook on building good datasets. How could I help build an organization that could produce really good datasets that are large enough for training and work for different kinds of problems? I created a team called DataOps, which is a combination of a domain expert, a data labeler, and a project manager. The idea was that: if I have a person in-house as part of the team working directly with the data scientist, then the loop of improving datasets would be much faster (as opposed to having a team in a silo where I send a request for dataset and receive it in a week later). This approach worked well for Veriff, even though it created a lot of chaos. The contrast is to have a completely separate labeling team or even outsource the labeling process.

Regarding the enterprise aspect of the company, most of Veriff’s revenue comes from large deals. In these scenarios, there is a natural friction between sales and product organizations. The salesman is incentivized to sell big deals, thereby asking for deliveries at whatever cost to help close these deals. Sometimes, this might mean building particular features for the client that would be 20% of the core product. On the other hand, the product person typically has in his mind a strategy to execute based on market trends, thereby often saying no to requests from the sales team. To resolve such a built-in tension between sales and product, I created a forum to get both sides talking and working together to solve challenging, difficult go-to-market questions. I learned that it was difficult to make short-term versus long-term tradeoffs: how much could we sacrifice revenue today to potentially have much higher growth 12–18 months from now?

On Dataset Tools

There are two extremes of data annotation tools:

On the one hand, there are tools in which you upload images into a folder, and they help you draw boxes in the image and save the files somewhere. Most of these are either old, bad, or costly.
On the other hand, there are full-fledged enterprise AI platforms that do everything from storing data, labeling the data, assuring label quality, to training models, running experiments, monitoring models.

Given the current state of dataset tools, it seems to me that you have basically these two choices: either you buy a whole platform, or you take this tiny piece that does box drawing and build everything else around it. As part of my own startup that I’m working on right now, I was investigating this exact problem: can I build some product in-between? My target audience was small-to-medium businesses, which can’t afford large platforms and don’t have a large enough team to build everything in-house. How do they manage? The answer is that they have no solutions.

I hope in the long term, there will be a vibrant ecosystem of interoperable tools (which exists in Software 1.0). You won’t need a single tool. The functionality of tools will be exactly the same as today, but I’d love to see choices between different libraries that help practitioners design customized Software 2.0 stacks.

On DataOps for AI

The primary function of DataOps is to build good datasets. If you look at how this is done today, the data scientist is typically responsible for this task. However, this person is also responsible for a huge range of things (especially at small companies): figuring out the labeling guidelines, building the infrastructure for different tools, writing training code, running modeling experiments, deploying microservices, etc. It’s evident that almost no one is good or even capable of all of these tasks at a high enough level of quality.

One solution is to outsource dataset construction to another company as a whole, which I think is a bad idea. If the model is a business-critical component, then the dataset is even more important. Another solution is to hire low-qualified labor to build datasets, which only work for really simple tasks. In such scenarios, you won’t even need ML in the first place.

Ultimately, this is not a technical question. This is a business question. At least from my observations, companies that initially started with outsourcing their DataOps function eventually brought back in-house due to privacy/quality/cost concerns.

On Data Labeling and Data Sampling

Labeling means deciding which exact label should be given per an input image. Sampling means choosing which input to add into the queue to be labeled. These two processes define your dataset and how your model will perform.

People think about labeling perhaps a bit more because sampling is similar to defining the weight you give to each category of examples. Random sampling might seem like a good conservative choice, but if you have a long tail of rare examples, you might want them to be more common in your dataset than they are proportionately in the real distribution.
Labeling-wise, you need some policy to which you can label the ground truth. When your policy changes, you need to relabel your dataset. Managing your policy well is challenging: how to describe your annotation guide in a versionable and explainable way?

On Label Store

I would like to see something equivalent to Git for ML datasets. Different tools are not interoperable because they each build their own features to manage images/labels. I could imagine a git-like system that stores the labels, and then every tool can use the labels from the system. They would never have to worry about versioning, branching, etc. This label store will be a central backbone of how we work with data.

On #EstonianMafia

It happened so that the founding engineers of Skype were based in Estonia. When Skype exited, many of these people probably made millions of euros. These people, in turn, started their own companies and invested in other companies. Estonia thus had a second generation of companies, such as Playtech, TransferWise, and pipedrive. This multiplication effect snowballs the Estonian startup ecosystem.

Estonia is also a small country with slightly more than a million people. If you work at some startup in Estonia, you probably know someone who knows any other person you want to reach (investors, early employees, executives, etc.) Since the network is relatively small, founders are supportive of each other, and there are no competitions between companies doing roughly the same thing. It will be difficult for us to get the scale of Silicon Valley, but Estonia is a world leader in startup-per-capita.