Datacast

Episode 77: Delivering Modern Data Engineering with Einat Orr

Episode Summary

Einat Orr is the CEO and Co-founder of Treeverse, the company behind lakeFS, an open-source platform that delivers resilience and manageability to object-storage-based data lakes. She received her PhD. in Mathematics from Tel Aviv University in optimization in graph theory. Einat previously led several engineering organizations, most recently as the CTO at SimilarWeb.

Episode Notes

Timestamps

Einat’s Contact Info

Mentioned Content

lakeFS

Blog Posts

People

Book

Notes

My conversation with Einat was recorded back in April 2021. Since the podcast was recorded, a lot has happened at Treeverse! I’d recommend:

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are highlights from my conversation with Einat:

On Studying Mathematics

I’m very dyslexic, so it’s tough for me to read. The only thing I can succeed academically is mathematics. University was the first place where the only thing I was studying was mathematics. I actually started with engineering, but I quickly shifted most of my courses to mathematics. Eventually, I got all my Bachelor’s, Master’s, and Ph.D. degrees in mathematics. It was a happy occasion for me to focus on something that doesn’t require any reading.

As the years went by, more research and less studying in classes also suited me well. I enjoyed lying on the sofa, looking at the ceiling, and thinking about deep problems that I was trying to solve. But I knew from the get-go that I wasn’t looking for an academic career. I didn’t think I was talented enough as a mathematician, but I enjoyed learning math.

Conceptually, there were two big jumps in my academic journey. The first was calculus, and the second was measure theory. Probability came out of the later one. I was excited about the implementation of probability theory in the world. Later on, in my Ph.D. work, I combined probability with graph theory and algorithms on graphs.

On Her Engineering Career

Working while being a Ph.D. student is quite common in Israel. I got a fantastic job at Compugen, a desirable employer for many students. I had intense 16-hour days combining work and study, but it was very satisfying. This is the first place I have ever coded professionally. I moved from C to C++. I learned how to develop algorithms with the best coding practices from the best people. In general, it’s not very common to see a bunch of mathematicians adhering to processes like that. Years later, I still referenced things I’ve studied at Compugen when I worked with other people who did not know those best practices.

I started as an algorithm developer at Correlix. As the company grew and the opportunity presented itself, I was given software developers to manage. Within two years, I found myself in a position to manage the entire R&D team. I was responsible for delivering the product. When I did it the first time, it was a dramatic experience. Being a VP of R&D is not a simple position. The best you can do is maybe do what is expected of you because expectations are always high. It’s hard to explain to business people who know nothing about development why things are complex and when they should be complex versus when they shouldn’t. It’s an interesting challenge, and I learned a lot from it.

The first time that I was a manager, I was horrible. I brought results to the organization, but I was a top-down manager. I did not use the talent of people who worked with me optimally. I dictated solutions many times. I was stressed, and I didn’t listen. As I was focused on the results, I was less focused on the people and how they brought results. Over time, I learned that I could get to the result by focusing on what people can do to help me get to the result. That completely changed how I manage people over time. I helped them succeed and listened more than I talked.

Similarly, at SimilarWeb, I was responsible for delivering the product. My job was to give my managers the context they needed to make decisions and explain to my colleagues and the CEO the limitations/innovations that we could offer. The work mainly revolved around ensuring alignment between R&D, product, and sales functions. I would bring in new startups that have ideas (which I thought) that might be relevant for SimilarWeb’s product.

On Co-Founding Treeverse

We had a very data-intensive data operation at SimilarWeb. There was a lot of legacy technology when I joined in 2014. My journey choosing data technologies made it very clear that: while there were a lot of tools in the market, there were also pain points that have not been addressed yet. Furthermore, SimilarWeb has become too big for me after 5.5 years. When communication started to revolve around the politics of an enterprise, I stopped enjoying my work. That’s why I decided to move on.

After I left, one of the people who reported to me left as well. He had the idea of lakeFS and shared with me the initial design. One thing led to another. I fell in love with the idea that he had, and that’s how it all started.

On Capabilities of lakeFS

At SimilarWeb, we had a data lake over S3. We used various tools such as Databricks, Hive, Presto, Redshift, and more for different use cases of consuming data from the lake. We wanted to optimize the technology for the use cases. We thought that we had the right architecture, but we still had a lot of pains around resilience and manageability.

Working over object storage is error-prone. Let’s face it: We managed the data in a shared folder. Seven petabytes of data used by 40+ people in parallel couldn’t be managed just by permission. Not if you want to democratize the data in the organization and allow as much insight to be gained from the data. While we love our architecture, we felt a need for something that allows for better manageability. We knew the patches we were using to avoid problems and prevent errors. We thought we could answer all those patches with one conceptual idea. That idea was allowing git-like operations over the object storage.

As we spoke to potential users (who later on became users), we realized that if we could test our data pipelines over production data before implementing them into production, our quality would be much higher. If we can test the data continuously in a continuous integration mindset, it will solve data quality problems such as missing data, improper metadata, incorrect data schema, etc. Organizations address these issues in many different ways.

Think about having a Git for your data in production. You can quickly revert to the last stable state of your data in a synchronized way over all your data collections. This task, when done manually or with scripts, is highly error-prone. With lakeFs, you accomplish that task with one atomic action taking a few milliseconds. You get the security of managing code on the data that you manage in a data-intensive application.

On Data Versioning-as-an-Infrastructure

For us, data versioning is the means, not the end. The end is to bring the right workflow and application lifecycle to the work of data engineers. In order to have that, you need to be able to version the data. When you use lakeFS, you can decide how many versions (if at all) to keep of the data, depending on how much you want to revert or get reproducibility. lakeFS versions the data to allow the branching, the committing, the merging, the use of pre-emergent, pre-commit hooks — thus enabling development environments and CI/CD for data.

This is a critical difference between us and other tools that version data for ML experiments (a specific, narrow need for versioning). Instead of looking at versioning vertically per application, we look at the problem horizontally and believe that versions are a property of the data. Versions are not a property of any application using the data. If you have two teams analyzing the same data using different tools, they would need to be able to communicate about the versions of this data. You don’t want the versions of the data to come from the tools they were using applicably. You want the versions of the data to come from a horizontal system that provides them with workflow tools and enriches anything they do with git-like operations. The key here is to look at git-like operations not as a feature but as a horizontal property of the data that should be holistically managed for the organization by one system. It’s also crucial for auditing, lineage, and governance over PII (among other things).

On Data Mesh

I see data mesh as a good intuition to implement lessons learned from the world of software application development to the world of data-intensive application development. But taking data mesh down from this excellent concept to day-to-day work is very complicated. I don’t think one solution fits all here, just like when everyone talks about microservices. Every organization manages its services not necessarily micro, but in a different way with a different logic. It all depends on the business, the R&D structure, the talent density, and the application itself. The same thing is with data.

I believe that looking at things from an agile way, building cross-functional teams in data, and having a data product manager/owner are all extremely important. Every organization should take this data mesh concept and adjust it to their own needs. Very much like agile development and microservices have been adopted in different ways according to the needs of organizations. There is a continuum between microservices and monolithic applications in software development, and every software team can find its sweet spot in between. The same thing should happen with data.

On Data Quality Testing

Understanding where and how to mind data quality is paramount.

  1. We need to first look at the level of the record itself. It’s best to enforce quality at the collection point and do as much testing as possible. If you find quality issues later on, it’s already a given fact that you have damaged records. So there’s a lot to be said about properly collecting data, either automatically or manually.
  2. When you go further in the data pipeline, you would want to check the properties of datasets — which should be done using statistical tests and anomaly detection.

To get quality over a data lake, you need the trinity of a tool with “Git-like” capabilities, a testing tool, and an orchestration tool. In this picture, lakeFS is the isolation and management of artifacts. You first ingest data into a branch, then create a pre-merge hook that calls the testing tool. Once merged, the orchestration tool can call the first job that runs and uses this dataset. This ensures that any data coming in and used within your daily analysis are tested for quality. If the test fails, the data is not merged, and the jobs won’t run. You will use lakeFS to capture a snapshot of the data at the time of failure. You now can debug the problem and understand what happened to the data and why the quality test failed.

On Underinvested Areas in the Data Engineering Ecosystem

  1. Metastore: Everyone is either using Hive or a managed version of Hive. And everyone is crying while doing that. Hive would probably be the only relic that would survive from the Hadoop ecosystem in a few years because no one knows how to replace it better. It’s painful to use and not scalable. This is a pain point that I think is currently overlooked. Not enough is happening there in order to allow smooth work for data engineers.
  2. Data Quality: Data quality tools are emerging, so we need to see how they evolve to address the painful problem of ensuring quality in complex data environments.
  3. Data Discovery: In the past years, more and more open-source tools have been released to the world from big enterprises. Discovery is becoming a bigger problem in organizations because the amount of data grows even in smaller organizations. Questions arise like: Where is the data? Who owns the data? What does the data look like? How is the data used in the organization? All these questions need to be answered by a very good system that manages the metadata. I expect the discovery category to continue growing and provide the need currently emerging with the rest of the world.

On The Open-Core Model

I believe the data ecosystem relies heavily on open-source, and hence the model is open-core. At least when it comes to data infrastructure, it’s already a basic expectation of data engineers to be able to use something as open-source. And it is also a basic expectation of the enterprises to have a permissive license of an open-source product to start getting trust in a product before committing to its paid version and enterprise features. I see a few companies not taking that path, and it’d be interesting to see how they end up competing with other open-source tools.

After you have the adoption of the open-source project and understand that you build something that people need, you can offer it as a managed service. Everyone appreciates this because there is such a shortage of DataOps people. If you build a managed product and relieve the need for your customers, companies will always be interested in paying for that.

Using an open-core model would be the best way to succeed in the data domain.

On Engaging Open-Source Contributors

We initially turned to the Go community to find early adoption. We told them that lakeFS is an amazing Go project, and they should help build it. It did work. However, they are not our potential users. They are enthusiastic about the language, but they are not data engineers. Data engineers don’t use Go. They use Java, Scala, and Python.

It is natural for our users to contribute using the languages they are familiar with. As the project evolves, we also create additional parts of lakeFS using those languages. Overall, I’d suggest that by building your project in the languages and architectures your users use, they will gladly help you shape your product by contributing code.

On Hiring Philosophy

  1. I, my cofounder, and the employees we hired are very experienced and known by a lot of people. We, as a team, have a strong professional network where we can pull into Treeverse.
  2. We frame lakeFS as a masterpiece of data engineering that anyone can go into the museum and look at this beautiful thing you help build. This attracts talented people to work on lakeFS (much easier for us than startups that make closed-source enterprise software).
  3. We are passionate about a vision: The entire data engineering world should use git over their object-storage data lake and data sources.

On The Data Community in Israel

It’s getting stronger. Today, very strong data companies in Israel educate data engineers and data scientists. The need for these professions shifts to security companies (which Israel is extremely strong at) and other high-tech verticals. As a vibrant tech hub, Israel has made a huge jump in the last 5 or 7 years and become competitive with any other data community worldwide. We develop technologies around data as an ecosystem. We are early adopters of cutting-edge technologies. We have data challenges that many people can learn from.