Datacast

Episode 72: Folding Data with Gleb Mezhanskiy

Episode Summary

Gleb Mezhanskiy is the CEO & Co-founder of Datafold - a data observability platform that helps companies unlock growth through more effective and reliable use of their analytical data. As a founding member of the Data teams at Autodesk and Lyft and the Head of Product at Phantom Auto, Gleb has built some of the world's largest and most sophisticated data platforms and has developed tools to improve productivity and data quality in organizations with hundreds of data users.

Episode Notes

Timestamps

(01:42) Gleb shared briefly about his upbringing and studying Economics in university in Russia.
(04:15) Gleb discussed his move to the US to pursue a Master of Information Systems Management at Carnegie Mellon University.
(07:07) Gleb went over his summer internship as a Business Analyst at Autodesk.
(08:40) Gleb shared the details of his project architecting data model/ETL pipelines as a PM at Autodesk.
(11:34) Gleb unpacked the evolution of his career at Lyft — from an individual data analyst to a PM on data tooling and a high-impact project that he worked on.
(16:54) Gleb shared valuable lessons from the experience of leading multiple cross-functional teams of engineers and growing the data organization significantly.
(19:48) Gleb mentioned his time as a Product Manager at Phantom Auto, leading the development of a teleoperation product for autonomous vehicles over cellular networks.
(25:28) Gleb emphasized the critical factors to consider when choosing a working environment: trusted managers/colleagues, maturity of tools/processes, and the function of data teams within the organization.
(29:10) Gleb shared the story behind the founding of Datafold, whose mission is to help companies effectively leverage their data assets while making Data Engineering & Analytics a creative and enjoyable experience.
(33:04) Gleb dissected the pain points with regression testing and the benefits of using Data Diff (Datafold’s first product) for data engineers.
(36:54) Gleb unpacked the data monitoring feature within Datafold’s data observability platform.
(39:45) Gleb discussed how to choose data warehousing solutions for your use cases (and made the distinction between data warehouse and data lake).
(47:03) Gleb gave insights on the need for BI and data observability/quality management tools within the modern analytics stack.
(50:40) Gleb emphasized the importance of tooling integration for Datafold’s roadmap.
(52:07) Gleb has been hosting Data Quality meetups to discuss the under-explored area of data quality.
(54:02) Gleb shared his learnings from going through the YC incubator in summer 2020.
(55:45) Gleb discussed the hurdles he had to jump through to find early customers of Datafold.
(57:47) Gleb emphasized valuable lessons he has learned to attract the right people who are excited about Datafold’s mission.
(59:17) Gleb shared his advice for founders who are in the process of finding the right investors for their companies.
(01:02:11) Closing segment.

Gleb’s Contact Info

Mentioned Content

Course

Harvard’s CS50: Introduction to Computer Science

Blog Posts

Modern Analytics Stack (June 2020)
Choosing Data Warehouse for Analytics (June 2020)
3 Ways To Be Wrong About Open-Source Data Warehousing Software (June 2020)
Buy Not Build (Aug 2020)
Datafold Raises a $2.1M Seed Round Led by NEA (Nov 2020)
Datafold + dbt: The Perfect Stack for Reliable Data Pipelines (Feb 2021)

People

Maxime Beauchemin (Founder and CEO at Preset, creator of Apache Superset and Apache Airflow)
Tobias Macey (Host of the Data Engineering Podcast)

Books

“How To Measure Anything” (by Douglas Hubbard)
“Lean Analytics” (by Benjamin Yoskovitz and Alistair Croll)

Notes

My conversation with Gleb was recorded back in March 2021. Since the podcast was recorded, a lot has happened at Datafold! I’d recommend:

Reading Gleb’s open-source edition of the modern data stack.
Listening to Gleb’s appearance on the Data Engineering podcast.
Watching the lightning talks and panel discussions from recent Data Quality meetups number 4 and number 5.

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are highlights from my conversation with Gleb:

On Studying Economics at University

I grew up in a family of entrepreneurs. After the Soviet Union collapsed at the beginning of the 1990s, my parents were among the first wave of entrepreneurs. I was very much in the spirit of entrepreneurship — getting things done and seizing opportunities. I chose economics as my undergraduate degree because it felt like a multi-disciplinary field of study (including math, statistics, social sciences, humanities) that would maximize my opportunities going forward. Looking back, that turned out to be the right decision. In my data career, the exposure to economic subjects and statistics were tremendously helpful.

Fundamentally, a data-driven organization makes decisions based on data. A considerable challenge is to understand what drives the business and what the cost of measurement is. Topics such as microeconomics and investment analysis forced me to understand the business impact and ROI, which is very important in analytics.

On Moving to the US

As I graduated from university, I knew that I wanted to go deeper into tech. I found this exciting program called Information Systems Management at Carnegie Mellon University. The program contained complex subjects in computer science (such as distributed systems, object-oriented programming, data science, data mining, databases, etc.). But it also offered them with a tight connection to the business world. You can think of the program as a combination of Computer Science and MBA. Ultimately, the analytics world is not just about technology. It is also about working with people and structuring analytical processes inside an organization.

I was a complete rookie in programming until my senior year of college. Via a friend, I learned about this course called “CS50: Introduction to Computer Science” offered by Harvard on edX. This free online course is the most popular one at Harvard and is taken by 100,000 people annually. It took me from absolute zero knowledge in computer science to a point where I could pursue a Master’s degree in the tech field fairly confidently. I did as well as many other classmates who had studied computer science in their undergraduates. CS50 is the most impactful course that I have taken throughout my entire educational path.

On Introducing Looker to Autodesk

Back in 2015, Autodesk did not have any standard BI tool that would allow a large number of people to see analytics and dashboards. So we introduced Looker, an emerging novel product at the time. As a one-man analytical team, I was able to roll out Looker to the whole 100-people organization. Not because I was particularly smart, but because I believe Looker was well-architected to provide self-service analytics to the end-users.

As a one-man data shop, I was given full autonomy to make technical decisions. I got buy-in quickly since my colleagues went from having disparate spreadsheets flying around to a centralized BI tool. Everyone was happy and onboard. As a matter of fact, Looker was later adopted by the larger part of Autodesk, as it has been proven to be a successful case for the consumer group division.

On The Evolution of His Career at Lyft

While data was impactful in my role at Autodesk, the types of decisions made based on data were quite high-level. I saw that Autodesk hadn’t made data-driven decisions yet (data was informative yet not critical to the company’s success). As a result, I deliberately sought a place where data would be the number one priority for the entire company. Lyft, at that time, was at a high-growth stage with about 600 employees to catch up with Uber.

When I joined Lyft, the entire company essentially ran off dashboards. There were executive meetings scheduled around reviewing a given dashboard and making $1M-worth allocation to stimulate drivers, provide promotions to passengers, or balance the markets. That posed a very challenging problem to the data team. I came onboard as analyst number 13. After almost three years there until I left, Lyft had about 4,600 employees (7x growth), and the data team had over 250 people. Such tremendous growth.

I started as an individual contributor working on whatever projects with the highest-priority impact at the moment — ranging from building forecasting models to help city managers understand markets better and forecast their metrics, to building reports for execs and product managers. But every time I did a project, I was frustrated with the productivity that I had. It took a long time to build an ETL pipeline or an ML model, as there were frictions in the data workflow. Naturally, I drifted to building tools for the data team to make them more productive. That became the emphasis of my role at Lyft. I became the 1st data engineer focused explicitly on building tools and eventually a product manager who directed ten engineers, enabling productivity for data engineers and data scientists.

On Seeking Opportunities in a High-Growth and Data-Driven Environment

In a high-growth environment (such as Lyft), the opportunity to make an impact is limitless. You should always ask to work on the most impactful things. You should work with leaders and managers to explore what strengths you can apply to help grow the organization. Many of these opportunities are unknown to the leadership, and if you see them, then it’s almost like intrapreneurship.

When I joined Lyft as a data analyst, I quickly realized that the entire analytics would spend an unreasonable amount of hours developing and testing ETL pipelines. I built a command-line tool that made this process much easier, in the realm of what dbt does today. That was tremendously impactful even though it was a simple tool for the team. No one actually would expect me to do so since I was a data analyst and not a software engineer.

Don’t be afraid to explore and focus on high-impact projects. Try to absorb everything in the organization because you’ll see opportunities to help the company grow, even outside of your role description.

If you’re working in a data-driven company, as an analyst, you will get exposed to probably all areas of the organization. Throughout my Lyft career, I have worked closely with finance, product, engineering, operations, legal, etc. In general, the benefit of working in the data/analytics space is that your role is critical to the organization by empowering so many different roles. It’s a great place to start your career because you can learn what everyone else is doing and what decisions they are making. Later, you can switch to other roles because you know from within what questions they are asking.

On Choosing Where To Work

When you work at a large company, you see more things around you and interact with more people. You also potentially get better training opportunities — as large companies tend to have more infrastructure, more resources, and better capacity to formally mentor and train someone who starts their career.

Startups and early-stage companies are the opposites of that. Not to say that you can’t learn there. You actually will probably learn even more than in a corporate environment, but it’s less predictable. In such an uncertain and hustling environment, you’ll be thrown into so many different projects. You will likely be incredibly inefficient in solving problems, but you will also learn a lot independently.

It depends a lot on the person’s mindset and nature regarding what environment would work best for them. The more important thing is not necessarily the type or size of the company, but the team you’ll be joining and the people you’ll be working with (both colleagues and management).

You want to look for an environment where you can trust management to know what they’re doing. You trust them to teach you best practices. There are so many teams positioned/motivated in the wrong way that can affect your career.
In the data space, it’s also important to look at the types of technologies, solutions, tools, and processes that the teams are using. Data analytics, just like software engineering, is evolving quickly. If you join a company that is still using legacy technology, you will waste a lot of time instead of making an actual impact. You will also potentially miss certain important principles that can position you as a strong candidate for your career.
The final factor lies in the role of the data team in the organization. Is the team there to build dashboards with vanity metrics showing an ever-increasing number of total users? Or are they actually on the critical path of making critical decisions for the business? If it’s the former, you won’t likely find a lot of satisfaction in your job. If it’s the latter, you will quickly find yourself in front of executives who will look up to you to make decisions.

On Founding Datafold

A general theme across my career is that I felt like analytics has become increasingly important for companies. Businesses invest a lot in collecting, storing, and processing data + buying databases, processing power, BI tools, + hiring data people. It’s no longer a problem to have petabytes of data and visualize them. But the abundance/rapid accumulation of datasets and the explosion of analytics inside companies create a different set of problems. How to manage this complexity? What data to trust? How to find data? Today, most companies that are serious about analytics would have 10–20x more datasets than they have for employees. So that’s a high complexity to navigate for anyone.

On the other hand, businesses are putting more and more demands on analytics. The expectations are that data should be accurate, reliable, available (let’s say by 9 AM when people gather and look at dashboards). That puts a lot of pressure on data teams to deliver high-quality, reliable data products — be they dashboards, ML models, or reports — without having the tools to manage that complexity.

Datafold provides tools to solve some of the most painful workflows for data practitioners: How to test changes to data processing pipelines? How to find data? How to understand each dataset looks? How to understand the dataset quality? Adding up all the friction points that we can solve, it ends up with a lot of value and time saved — allowing companies to use data faster and better.

On Data Diff

All analytics and ML are fundamentally based on some atomic pieces of data, like events that describe certain actions happening in the software system. You deal with raw data that is highly noisy, not ideal, and all disparate. Companies put a lot of transformations: steps to take the raw data, combine, merge, group, clean, and eventually make them usable for the end-users to plot dashboards or feed into ML. These transformations contain a lot of business logic. Given the complexity of all the business logic in the modern data pipeline, making changes to that is highly error-prone. There’s not enough visibility into how the changes you make to the source code impact the produced data. To make things even harder, you can change things in one step of the pipeline and have completely unexpected repercussions at the end of the pipeline (as there maybe 4 or 5 steps applied after the change).

Today, change management testing takes a lot of time for data professionals. At larger companies where the stakes of making the wrong decision or reporting wrong numbers are high, data engineers can spend up to a week of manual work just to test one simple change, and that’s not a good use of their time. Data Diff visualizes an easy way for data engineers to see how a change in the data transformation recipe affects the produced data, both in terms of statistics and specific values. That saved them a lot of time and allowed them to move with higher confidence — reducing the probability of breaking something for the business.

On Data Monitoring

Data monitoring is something that teams have been doing manually. Let’s say I’m a data engineer who owns a particular set of tables that are used by some executive or PM-facing dashboards. I would probably check maybe a couple of times a day to make sure that everything is fine. That’s a daunting process, and I’m not using this time to create new things.

With Datafold’s metrics monitoring, we created an easy way to define any metrics. An example would be how many rows does a table containing users get a day. We want this number to be normal. The issue is that normal is a vague definition because you may have more users on special occasions. We made a simple workflow where a data user can define this metric using SQL — a common language for expressing analytics. Then we pulled out a time series and fed an ML model that learns the behavior of this metric and alerts the user whenever it is outside of a predicted interval. We took into account all the fluctuations due to seasonality effects. Basically, the user can focus on creative tasks and have Datafold monitor the data for them.

On The Modern Analytics Stack

An area where I think we’ll see more innovation is business intelligence (BI). Traditional BI tools optimize for how to create dashboards in the easiest way or how to enable data users to explore data without typing SQL. Looker and Tableau are great products for that. But we start to see a gap in BI tools, which is the lack of intelligence. Ultimately, it’s not enough to show a dashboard to a stakeholder. People are ultimately interested in not just the change in data metrics (sales increase by 10%) but the question of why. What was the main driver behind that increase? Sometimes, it’s not enough to dissect this by cities or products. Sometimes, we have to look at whether the sales increase because the conversion rate in our site increases. These kinds of insight are currently not well supported by the modern BI tools. We are seeing certain players coming in and offering such a deep dive in an automated fashion to help understand what driver lies behind the metric movements. Eventually, we will see a hybrid solution where we can visualize things and see the why behind them.

Another area that I’m excited about in the modern data stack is data observability and quality management. This is essentially the area that Datafold falls into. This is not a step in the data value chain but rather a vertical component that integrates with every single step and provides you with a complete view of the data alongside the value chain — all the way from instrumentation and warehousing to BI. This includes having visibility into the data flow, understanding when things break/why they break, and having the ability to trace incidents. Such an area will be very impactful for data teams to move faster and with higher confidence — one that I am excited about building.

On Data Quality

The topic of data quality has been largely under-explored, and there are much more questions than answers in this domain. I see our job at Datafold is to facilitate the discussion so that everyone can express their challenges and share best practices. The trend that I’ve noticed from the lightning talks and panel discussions in our Data Quality Meetups is that data quality has become increasingly important on the data roadmap and even brought up to the organizational level (OKRs and KPIs). Teams are paying increased attention to this process. Best practices from software engineering such as continuous integration, automated testing, and code review are now propagating into the data world; because data products are gaining the same (if not more) importance than software products within companies.

On Going Through YC

At a high level, YC provides an excellent framework for thinking about your business. They force you to answer for yourself uncomfortable questions: How do you know what people actually like what you are building? Do you have any evidence for that? If not, what is the problem that people actually care about? By answering these questions, you propel your business and avoid wasting time on bad decisions. That was the most helpful thing for me as a first-time founder.

The second great thing about YC is the community. Being part of YC is being part of a vibrant, active, and supportive community. I made great friends through the community. I got customers, answers for my questions, and various lessons from founders in previous/current batches. That’s been super helpful for me and Datafold.

On Finding Early Customers

There are two key challenges:

Our first product offering is Data Diff. It was counter-intuitive to many teams, as they did not realize how to use it in their workflows. We didn’t know exactly how to market or explain the product. From personal experience at Lyft, I know that would be useful. So eventually, we stumbled upon a few data teams that immediately understood the value of Data Diff. We rolled the product out to these teams and iterated from that. Then, it’s the matter of speaking with more teams, listening to their pain points, making the narrative in our pitch better, and finding the audience who clicks.
The second challenge lies in the security and privacy concerns for modern data tools. Unlike five years ago, companies won’t plug in their database credentials to random tools because they risk exposing very sensitive (potentially regulated) data — which can be disastrous for the business. As part of this trend, as an early-stage startup, we had to do various things: from filling out 300-question long surveys about security from certain customers, to finding a way to deploy our product in their cloud environment so that data never leaves that environment. These activities simplified many things for us in the customer acquisition process. Today, it’s not that hard to do if you architect your product in the right way.

On Hiring

Hiring is very hard. What has worked for us is a combination of (1) the network that we have built over the years of working in the industry and having good working relationships with people from previous companies; and (2) cold emails/cold outreach. I found many people in the data community who were also excited about building tools. For them, the opportunity to work for Datafold was not just about engineering, but also about solving problems they care about.

I think for any startup, outreaching to people who are likely to identify with your mission (and not just are good engineers) probably would be the most effective use of your time. At a very early stage, your product won’t be likely to make sense to anyone, but people who actually experienced the problems. Those are the people who are both passionate about the mission and also likely to have ideas/thoughts about the problem space.

On Fundraising

Fundraising is a full-time activity: You should dedicate at least one or two months to doing just that (meaning going headfirst and focusing on it). Take as many meetings as possible. Get as many options as you can in terms of investor offers. Then decide it. Every conversation with investors is a pitch.
There is an abundance of capital in the industry right now that is looking for good ideas and good teams: If you have the right signals (like industry domain expertise, the product, the traction, the vision), you shouldn’t have problems raising. It will still be challenging, but you’ll be able to find the money. You’ll need to make sure that you have everything in check to present your idea and your company coherently. It took me quite a few meetings to figure out exactly what I was doing wrong during the fundraising process.
If you want to learn this topic, I’d highly recommend watching YC videos and reading Paul Graham’s essays. Paul Graham is the founder of YC and an amazing computer scientist/thinker/essayist. He provides a lot of wisdom in a very succinct way about startup fundraising.