Datacast

Episode 55: Making Apache Spark Developer-Friendly and Cost-Effective with Jean-Yves Stephan

Episode Summary

Jean-Yves (or "J-Y") Stephan is the CEO & Co-Founder of Data Mechanics, a Y-Combinator-backed startup building a data engineering platform that makes Apache Spark more developer-friendly and more cost-effective. Before Data Mechanics, he was a software engineer at Databricks, the unified analytics platform created by Apache Spark's founders. JY did his undergraduate studies in Computer Science & Applied Math at Ecole Polytechnique (Paris, France) before pursuing a Masters at Stanford in Management Science & Engineering.

Episode Notes

Timestamps

His Contact Info

His Recommended Resources

Episode Transcription

Key Takeaways

Here are highlights from my conversation with Jean-Yves:

On Doing His Undergraduate at Ecole Polytechnique

The undergraduate math and computer science classes that I attended were theoretical.

I wasn’t sure at that time what I wanted to do. Startups weren’t popular yet. Most of the engineers graduating from my school went into consulting and finance.

On Doing His Master at Stanford

The first class that I took was CS 229 — “Machine Learning” with Andrew Ng. That started my interest in the field. Having a background in Math helped a lot.

However, the class that I enjoyed the most was CS 246 — “Mining Massive Datasets” with Jure Leskovec. This is a big data class that focused on distributed engines and big data technologies, with a mix of computer systems and math knowledge.

I also had the chance to be a Teaching Assistant for these two classes — grading homework and holding office hours. There were students with very different backgrounds. I enjoyed being a good listener and coming up with examples to explain abstract concepts to students in small groups.

On Leading Spark Infrastructure at Databricks

Databricks was a small startup when I joined, with about 40 employees. Even though the company was small, Apache Spark started to become famous. Having worked with Hadoop during my summer internship, I could see how Spark is faster and more efficient in dealing with in-memory data. What also made a strong impression for me is their interview process, which was really hard. I reckoned that the team must also be very strong.

I initially joined the Cloud Automation team, managing the entire cloud infrastructure of Databricks. Gradually, I evolved into the lead of another team called Cluster.

When I joined, Databricks only had about 20 customers. By the time I left 3 years later, they had maybe a thousand customers. We were launching hundreds of thousands of nodes to the cloud every day.

This means we had a lot of support/firefighting work. Still, we fixed many bugs, gradually made the product more stable/efficient, and launched more machines into the cloud daily.

The last challenge was to become data-driven. In the beginning, as a startup, we didn’t have a big observability stack for our software. As we scaled, my team helped define metrics and measure/improve those KPIs. I learned significant engineering skills related to that growth.

On Founding Data Mechanics

My co-founder, Julien, was a long-time friend. He was a Spark user, while I had more Spark experience as an infrastructure provider. Together, we were frustrated that Databricks and its competitors did not solve our pain points far enough. We had the feeling that we could build a data platform that could solve problems for profiles like us.

We wanted to make Spark more developer-friendly. We also wanted to make data infrastructure more cost-effective. Our goal was to build a data platform with automation that solves these problems — automatically choosing the type of cluster instance, automatically sizing the cluster, and automatically configuring the Spark code to be more efficient. The end users can focus on building their applications.

On Three Core Tenets of Data Mechanics

On Spark-On-Kubernetes

Since 2018, Spark has been deployed on K8s instead of Hadoop YARN. With the release of Spark 3.1, the Spark-on-K8s integration is now production-ready under general availability.

The first benefit is native containerization:

The second benefit is cost reduction: A single shared infrastructure with very fast startup time makes your build cost-effective.

The third benefit is the K8s ecosystem: K8s is very popular. There are many tools for K8s monitoring, security, networking, CI/CD. You get all these tools for free when you deploy Spark on K8s.

I would say the main drawback today is that most commercial Spark platforms still run on YARN. If you need to run Spark on K8s, you probably need to do it yourself using open-source code. That requires expertise.

That’s why we build Data Mechanics to make Spark-on-K8s easy to use and cost-effective. We manage the K8s clusters so that our users don’t need to become K8s experts. In fact, they don’t need to interact with K8s at all. They just use our API or web UI instead.

On Data Mechanics Delight

Everyone complains about the Spark UI because it’s hard to know what’s going on and requires a bit of Spark expertise. It also lacks metrics about CPU usage, memory usage, I/O usage, disk usage, etc. Typically, most data engineers use a separate system like Prometheus or Datadog to view these metrics. However, these separate systems don’t know Spark, so the engineers end up jumping back and forth between the Spark UI and their metrics monitoring system.

That’s why we built Delight to have a better birds-eye view of what’s going on in your Spark application.

On Doing Y Combinator

We learned how to be bold:

We learned to talk to our users (a popular YC mantra):

We learned to focus:

On Getting Early Customers

There were a couple of things that helped:

On Hiring

For the first few hires, we hired entirely throughout our network. I would ask my friends, past colleagues, and our investors whether they knew someone who might be a good fit. These hires are more engaged and easier to convert.

Another lesson that I learned is to make everyone an owner by giving stock options to our employees. When you join a very early-stage startup, the startup will become your baby. It’s important that you will be rewarded with ownership of the company.

Lastly, it’s crucial to define the culture — including the kind of people we want to work with. Then, trust our instinct to identify those people.

On The Tech Community in France

There are a few assets in France: (1) talented engineers from great engineering schools, and (2) attractive tax-subsidy government grants for startups.

However, most of our customers are in the US: