Datacast

Episode 93: Open-Source Development, Human-Centric AI, and Modern ML Infrastructure with Ville Tuulos

Episode Summary

Ville Tuulos has been developing tools and infrastructure for data science and machine learning for over two decades. At Netflix, he led the machine learning infrastructure team. Currently, he is the CEO and co-founder of Outerbounds - where he’s building the modern, human-centric ML infrastructure stack, continuing the open-source product called Metaflow that he developed and managed during his Netflix days.

Episode Notes

Show Notes

Ville’s Contact Info

Outerbounds

Mentioned Content

Talks

Articles

People

Book

Notes

My conversation with Ville was recorded back in October 2021. Since then, many things have happened at Outerbounds. I’d recommend:

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are the highlights from my conversation with Ville:

On His Education in Helsinki

Higher education in Finland is free, so people have a lot of freedom to choose what they study. It is a unique system where the government even provides students with a stipend. I did my undergraduate degree in computer science, math, and psychology — which was a good combination for ML infrastructure now in hindsight. I enjoyed various CS courses, even the foundational aspect like the theory of computation, the philosophy of AI, and distributed computing.

During university, I was part of a research group called Complex Systems Combination and was exposed to academic research in agent modeling and statistical information retrieval. That was fantastic learning in which I got to build many interesting prototypes with different ML techniques. I ended up staying there for seven years.

As a graduate student, I thought maybe I would do a Ph.D. and stay in academia. But then I started going to conferences like NeurIPS and SIGIR and saw that people at large companies had access to large compute resources and large datasets. I thought the future of ML was going to happen at companies, so if I want to do the most exciting work, I have to be one of them. Unfortunately, that has proven to be true even until now.

On Working As A Researcher at Nokia

To paint you a picture, back in 2007, this was just before the iPhone was launched. Before iPhone, Nokia was by far the largest smartphone manufacturer globally. In Finland, Nokia used to be the largest company. They founded a new research lab in Silicon Valley with the idea that the future of smartphones will be driven by software and data will play a significant role.

In 2007, I thought this mobile thing was going to be big. People back then had already been talking about ordering taxis, watching videos, or listening to music on the phone. Those things ended up happening, but Nokia did not make it happen. A professor of mine jumped into Nokia to eventually become its CTO, so I had the opportunity to follow in his footsteps and end up there. Overall, Nokia was a great learning experience, both technically and business-wise — how the world’s leading company lost its way over a short period of time.

Nokia’s vision then is that: assuming we collect terabytes of data from the real world through mobile phones, how will we store it? This was about the time when Hadoop and MapReduce started becoming big. The original MapReduce paper by Google was published in 2004, and Hadoop started around 2006. This was also when the key-value store became a big thing when the original paper about Amazon DynamoDB came out. Overall, many companies were rethinking how to store data at a large scale by following the footsteps of Google and Amazon. Those were exactly the relevant questions we tried to answer at Nokia Research. There were not many mature off-the-shelf solutions available, especially in open-source, so we decided to build our own (such as Disco and Ringo). That was a good learning experience about building open-source communities.

On Co-Founding Bitdeli

At Nokia, I led the research team on open-source approaches that competed directly with the quickly growing Hadoop stack. We had the feeling that what we were doing with Python and using a cloud-based stack would be way easier and more agile than the heavyweight approach. We thought we could make the stuff we have been building at Nokia more widely available through Bitdeli. Our vision is that the product would be cloud-based and Python-based. Back in 2011, many large companies were not yet in the cloud. Python was definitely not as prevalent for data science and ML, especially in private environments, as it is today. One could say the timing was not quite perfect with that startup. Still, I am proud of the product we built.

Honestly, it is the usual entrepreneurial story of finding product-market fit. What is the kind of one use case we should go after? We were thinking ML, which was way too new at the time. ETL would have been a good focus area but was still nascent at the time. Eventually, we could not find a product-market fit and were acquired by AdRoll.

On Building TrailDB

Imagine the problem of selling anything on the Internet. A key part is to understand your customers. If you want to do this in a data-driven fashion, you can follow your customer’s behavior and record such behavior. On the technical side, how do you get the data in the first place? Often, the most interesting modeling problems are: How does the behavior evolve over time? What are the steps they take before buying something? You get a trail of events people do and can construct a data schema from that. You have the primary ID (which is the user ID) and a bunch of events ordered by the time associated with this user (basically the trail this user has taken to reach an outcome).

You can definitely store the data like this in a relational database. However, one difficult thing is that it was surprisingly hard to query data like this. Over time, you need to have a window function or an intense SQL query to do that task. So that was the motivation for TrailDB. I wanted to make something that’s super easy and simple to deploy, so it is a C library that you can integrate with different languages. The deployment story, therefore, was much easier than before. The beautiful part is that TrailDB ended up producing tens of millions of dollars in revenue for AdRoll.

On Leading Data and ML Efforts at AdRoll

We have discussed how I started at Nokia building open-source technologies, worked on my startup and could not find product-market fit, and built a new product from scratch and saw it being launched. As the Head of Data at AdRoll, it finally clicked in my head how important it is to focus on the product. It is a classic story for any technical person that you start being fascinated by the tech, and over time, as you climb that ladder, you start seeing the bigger picture and understand how important it is to focus on solving the right problems. Sometimes, the right problems are different from the interesting ones. As we presented complex solutions to the users and tried to understand their point of view, I learned that we do not always have to choose a fancy approach. Sometimes something simple can be perfectly adequate if presented in the right context and the right way.

Another big lesson for me is that ML systems do not work in isolation but need to have an interface with human beings. AdRoll had various predictive systems to acquire more customers and have more sales. Oftentimes, an account manager or salesperson handled the human relationship with the customer and sat between the ML models and the customer. The challenge was that: if you have an ML model that constantly evolves based on incoming data, it can be challenging for the human stakeholder to understand and explain the results. This can be hugely problematic. Working in conjunction with the model and designing the interface/interaction with the human is a huge part of the ML modeling process.

I also learned about my strengths and weaknesses. I started having a more refined understanding of the difference between management and leadership. Some individuals are absolutely amazing managers, and some people are absolutely amazing leaders. Occasionally, these people are not the same. You can be a good manager without being a great leader, and vice versa. There is certainly some kind of cross-pollination, but they have different skills. That is a useful distinction to make and takes me a while to understand fully.

On Joining Netflix’s ML Infrastructure Team

My not-so-secret plan at AdRoll was that: given the company was growing so fast at the time, I was hoping they could make a quick exit, and I could found another company because I still had many ideas from the Bitdeli era. But the company was still around doing well, so I started thinking about what to do next.

On The Motivation Behind Metaflow

Netflix has been applying ML for a long time. Famously, they have a recommendation system that recommends TV shows and movies to users, and it is the crown jewel of all ML efforts at Netflix. At the same time, it is the tip of the iceberg. Many people do not realize that most ML projects Netflix does are not directly visible in the product. Especially as the company was becoming more global and producing original content, they realized that there are many opportunities to apply ML (like computer vision, NLP, classical statistics, etc.)

Netflix had the same situation many companies are finding themselves in these days. On the one hand, they had plenty of infrastructure (data warehouses, orchestration systems, compute platforms, etc.). On the other hand, they had data scientists who were not necessarily software engineers. These scientists were not the ones writing Docker files or interacting directly with CI/CD systems or Kubernetes but domain experts in topics like NLP and vision. Something was missing in the middle. How are we supposed to give tools to these scientists so they can benefit from the existing infrastructure and get these projects to production as quickly as possible? That was the setup and the problem space at the time. In 2017, the term MLOps did not exist at all. There was not any template to derive best practices from. I remember joining Netflix and asking people who have been thinking about this problem space what we should build. Nobody knew exactly what the infrastructure would suppose to look like.

On The Philosophy of Metaflow

A key idea behind Metaflow is human centricity — nothing else matters as much as the user experience. From the beginning, we started with the premise that everything is technically already possible, but nothing is easy enough. Our job is to make these tasks — let it be using TensorFlow or running XGBoost at scale — easy enough so that people can use their creativity and domain knowledge to solve real business problems. That is why human centricity has been a core value for us.

Another one is the product mindset. I mentioned before that I had become obsessed with the importance of thinking about coherent product design. Oftentimes, what happens with engineering is that you build features (distributed training, experiment tracking, hyper-parameter optimization, etc.) and slap them together. But the end result does not have any kind of cohesion. That is why anybody who has used a beautifully designed product like Apple knows the importance of composability — you could take pieces apart and put them back together seamlessly.

Given Netflix’s culture, it is critical that Metaflow must be a pragmatic system and not a pie in the sky research project. Netflix ultimately is not in the business of building or selling infrastructure. Everything serves the end purpose of entertaining the world. That is why we needed to build something that helps solve actual business use cases.

I had always been an enthusiastic Python user (going back to the Disco day), so we started with a Python-first approach. It became so much easier after TensorFlow was released in 2015 and Python was ready for prime time. Additionally, cloud integration is the final piece of the puzzle. Netflix is a 100% AWS shop. Integrations with the best parts of the cloud would be vital for our backend storage.

On The Data Infrastructure’s Hierarchy of Needs

This diagram encapsulates the philosophy behind Metaflow, as well as Netflix’s value of freedom and responsibility. Netflix has this idea that every employee has the freedom to choose the best tools for the job (the best modeling approaches in the case of data scientists). We wanted to give them a lot of freedom at the top of the stack and let them work on things they enjoy working on.

On the other hand, there were things at the bottom of the stack — foundational stuff that must work. If you are an engineer and work with Amazon S3, you want to be sure that the S3 files should never disappear and always be available. You could not care less how you actually do it. The same thing happens with containers. Maybe you want to execute the container. You do not care where you find a server to do it. You can keep going up the stack, and it is a continuum. This diagram reflects the interaction between data scientists and engineers. Oftentimes, there is a distinction that they care about different aspects. But both are important and can be complimentary.

On Finding Metaflow’s Early Adopters

I had the benefit that Metaflow is my third big open-source project, at least because I had invested time. Back when I worked on Disco, I had started with the idea of using the programming language Erlang. It is an amazing language, and I think it was technically the right tool for the job. But the unfortunate fact is that if you use any esoteric language (Erlang, Carmel, Haskell, or whatever your favorite language is), that will restrict adoption. So the first lesson is to make it easily approachable.

The other thing I learned from TrailDB is that the open-source project can be a good technical solution, but you also have to provide everything else around it. Internally at Netflix, we adopted the mindset that we have to sell Metaflow just like a startup. Of course, the dynamics inside a big company are very different, but we had to invest in things outside of the code itself. Any open-source developer needs to invest in documentation and support. One of the big success factors for Metaflow internally at Netflix is the Slack channel we provided — where people could ask anything. It is not only that we were responsive on the channel, which I think is important, but also the style we were doing it — being human-centric and having a high level of empathy in user interactions.

It is an awkward situation when you try to use something and hit a roadblock. You do not know what to do, so you open an issue on GitHub or send a message to StackOverflow. On the one hand, it makes you feel stupid to hit that issue (maybe everybody else has figured it out). Else, it is just annoying that your work is blocked because of someone else’s stupidity. There is much emotional baggage coming up with that support interaction in either case. Understanding how to navigate that and giving friendly answers fast are massively important for open source adoption.

On Prioritizing Metaflow’s Future Roadmap

There are surprisingly many failures in open-source product management. A role model for us is the Linux project. I have always been following Linux kernel development with great interest. Linux is hardly a wonderful role model of behaving well and being inclusive when it comes to product management. But if you look at Linux as a technical achievement, you can argue that it is probably the most successful open-source project ever. There are some lessons about leadership that can be learned there.

To maintain the long-term health of Metaflow, we think about a few things:

The beautiful thing about open source is that the barrier to entry is so low since anybody can fork a project. I like a model in which people with a certain point of view start pushing the project as far as possible. That is how we can get the healthiest competition with a rich ecosystem of different approaches. With the committee-by-design method, everybody gets averaged out, and you will get mediocre solutions that all look the same. That is much less interesting and less useful than having amazing individuals who have a strong vision.

Usage adoption is undoubtedly important for us. The biggest leverage is the features we should implement. Recently, we supported Kubernetes. Kubernetes is an interesting example because I do not think data scientists should use Kubernetes directly, but it is highly relevant for engineers. For us, implementing the Kubernetes support is how they can access different clouds (AWS, GCP, Azure). For companies that have already installed and deployed Kubernetes clusters, it makes sense to meet their needs — their ML platforms do not need to run in a totally separate environment from the rest of their infrastructure. Ultimately, ML should integrate with the rest of the software infrastructure, which is ultimately how we provide value to our users.

On Founding Outerbounds

Metaflow was open-sourced at Netflix in 2019. It has always been a bit of an experiment to see if anybody cares. Towards the end of 2020, we started getting many questions from other companies outside of Netflix asking for more support as they adopted Metaflow. Obviously, Netflix is not in the business of supporting other companies. As the team manager, I faced the problems of how much we could prioritize the open-source work and support these other companies that occasionally had some feature requests that were not relevant for Netflix at all. For instance, many companies were asking for support for Azure, and Netflix did not use Azure, so it was not a high priority for them. I faced a soul-seeking moment: should we tell the community that we cannot support them (which would not be great for the community’s long-term health), or should we start doing this full-time? Eventually, I decided to make the leap and leave the really nice job at Netflix. Obviously, it was not easy, but at the same time, the opportunity to help other companies felt even more exciting. Of course, we are still working closely with Netflix in an active joint community now.

I knew Oleg Avdeev from AdRoll and had been working closely with him for many years. Savin Goyal was on my team at Netflix, leading the open-source development of Metaflow, and I also knew him for a long time. Interestingly enough, Oleg had been working at a feature store company called Tecton as well, which is very much in the same domain. Honestly, I felt this is an amazing founding team. The big question is: can I get them excited enough to do this together. Luckily, both of them said yes.

On Community Engagement

Metaflow could not have been made possible without the very tight interaction that we had at Netflix. A couple of things were critical in the early days.

  1. We got to be exposed to a wide diversity of applications. Oftentimes, what I see happening at other companies is that one key application (like Netflix’s recommendations) drives the development of the ML platform. Then they overfit it. We had a different situation, where we got to interact with many different kinds of problems, which was really important.
  2. Data scientists were open enough to allow us to work closely with them. Sitting next to them was important.

Definitely, we want to continue doing the same at Outerbounds for other companies in the world. But I realize that the dynamic is quite different. At Netflix, everybody got paid a monthly salary by the same company, so there was no commercial incentive per se. It is more complicated in the outside world. For now, we have a Slack community at slack.outerbounds.co — where we invite everybody to join and provide the same level of support that teams at Netflix got from us.

We are curious to learn not only how people use Metaflow but also what real-world problems they are facing. The challenge here is that it is very hard to get high-quality information. What are the actual pain points? There are so many psychological biases and social dynamics in terms of what to reveal and not to reveal. One of my favorite examples is always dependency management. Imagine you have to install TensorFlow or PyTorch. People spend an inordinate amount of time fighting with their virtual environments to get them installed. Eventually, they managed to do it and build a model. When you ask them: “What was the hardest thing about the model?” They might say something about figuring out the right activation function. Well, that was not the hardest part. The hardest part was installing PyTorch in the first place. Somehow, it does not feel glorious at all.

We really want to understand how people are actually spending their time. What are the business problems they are solving? I mean basic linear regression problems in the world that are super valuable for businesses. I am equally fascinated by those use cases compared to the fanciest GANs people build and make demos.

On Hiring

In this day and age, people are used to the fact that all information is available online. You can look at people’s social media accounts, blogs, and so forth. This can be hugely valuable. The thing is that making these career choices is very important in many ways, both financially and socially. I have been on the job-seeking side many times, evaluating different career options, so I empathize with them. I am sure that smart people will do their homework to understand the startup idea and the founders’ track record: Who are these people? What have they been doing? Why are they doing it? Have they done it before? Are they behaving nicely? Does this idea resonate with me?

In Outerbounds’ case, open-source helps tremendously since every single line of code and commit becomes public. People can evaluate the technical quality, the types of interactions, and the blog posts to form a cohesive picture of us. For anybody thinking about starting a startup eventually, maybe not today, I think you should share more about yourself, assuming that people want to know and learn about you. Resist the temptation to build a fancy public persona because people can see the lack of authenticity. Then the self-selection happens.

We have all kinds of ambitions moving forward. It is a hugely exciting time in the industry because so many companies are trying to find solutions for their technical stack. Some parts are maturing fast, while some others are lagging behind. Looking at the positions we have open today:

  1. If you are a software engineer knowing Python and want to build ML infrastructure, definitely reach out to us.
  2. If you are a front-end engineer interested in building delightful web applications, we would like to chat with you.
  3. Talking about the product mindset again, we take our documentation seriously. We want people who also have a background in data science and ML — not so because we would ever necessarily become an ML consultant and build solutions ourselves, but more so that we can empathize with the user’s pains.
  4. We look for people with cloud experience (AWS, Kubernetes) on the infrastructure side. Building solutions that resonate with data scientists and engineers is usually important.

On Writing

Writing a book requires a crazy amount of work. When Manning approached me in the summer of 2020 for the first time, I was hesitant. At Netflix at the time, I felt documentation is really important. I asked myself: what is the point of writing technical books these days since all materials are available online. Then I thought that these topics were quite complex, and it felt that every blog article or presentation only scratched the surface. At least, it would be useful to have one place where we could have all this information in a single source. To me, writing “Effective Data Science Infrastructure” seemed like a good challenge.

When writing a chapter, I always start by developing examples first. Thinking about common, full standalone examples that simplify the point actually takes quite some time to develop. People do not appreciate that aspect of technical books. The big challenge is that somebody needs to write these examples and make them not obsolete too fast.

On The Modern Stack of ML Infrastructure

These days, you hear the term “MLOps” being dropped around everywhere. It felt silly that suddenly people start slapping a label in different contexts. It is unclear what it even means, and it confuses more than clarifies. We want to answer the question: Do we even need a label like MLOps? In order to answer that question, we started thinking about the differences between developing ML applications and traditional software engineering. Maybe we are just writing a new type of software, and eventually, it will become mainstream as anything else, so we do not need any new terms.

The key premise of our O’Reilly article is that there is one key point that differentiates ML development from traditional software development: data.

It is fascinating to think about the long arc of computing and programming overall — what programming looked like in the 50s, the 70s, and the 90s. Some stayed the same while some have changed. I believe data-driven programming is a new paradigm that is here to stay.