Datacast

Episode 94: Modern Metadata Management, Open-Source Adoption, and Early-Stage Culture with Mars Lan

Episode Summary

Mars Lan is the co-founder and CTO of Metaphor Data, a startup that offers a Modern Metadata Platform to solve many complex organizational data challenges, such as data discovery and data governance. He also co-created the popular open-source project DataHub while working as the Tech Lead of the metadata team at LinkedIn. Mars received his Ph.D. in Computer Science from the University of California, Los Angeles.

Episode Notes

Show Notes

Mars’ Contact Info

Metaphor Data

Mentioned Content

Articles

Papers

People

Books

Notes

My conversation with Mars was recorded back in January 2022. Since then, many things have happened at Metaphor Data. I’d recommend:

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are the highlights from my conversation with Mars:

On Studying in New Zealand

I grew up in New Zealand and went to The University of Auckland to study Computer Systems Engineering — a degree that crossed between electrical engineering and computer science. The New Zealand education system is pretty solid. It stresses a good balance between theoretical work and practical work. In fact, to graduate, I was required to take on 500 hours of practical work (i.e., internships). I certainly enjoyed my study there, as I got a solid understanding of the fundamentals of EE and CS.

On His Ph.D. at UCLA

I remember taking the first class on algorithms for my graduate degree. I was shocked by how vocal the students were. A professor asked a question, and everybody tried to raise their hands to answer it. At that moment, I felt like I was the dumbest person in the room — I had not fully comprehended the question yet, and people were trying to answer already. However, I think that is part of the US education, which teaches students not to be afraid of voicing their opinions, even if they are not sure about being right. Later on, I realized that was actually the case. Most people who raised their hands actually did not know the correct answer. They were just eager to express themselves. The US education system’s strength is building up people with characters who can voice their opinions and have constructive conversations with others. UCLA is top-notch in terms of its graduate program. I learned so much from taking classes, doing research, working on various internships, etc. Those opportunities are much harder to come by outside of the US, in my opinion.

My advisor started in the area of algorithms with a focus on electronic routing. But then he quickly moved into applying computer science to healthcare-related issues — challenging healthcare problems via remote monitoring and machine learning. As a result, my research specialized in E-Health:

  1. I developed a robust and scalable lane departure warning system for smartphones. Many standard cars these days have the feature where if people doze off, they start to drift off their lane without indicating their indicator. You either get a vibration on the wheel or a warning signal telling you to wake up. I attempted to implement that entire system on the very first Android phone, which was very underpowered. But that was also where the challenge was: how to implement a good image processing algorithm on a limited resource.
  2. I built SmartFall, an automatic fall detection system to help prevent the elderly from falling. This walking cane could detect when an elderly fall in an uncontrolled fashion. This was also the early days of machine learning. People were not doing deep neural nets or anything like that, so the ML algorithm was a lot easier back then to implement something like this. The smartphone was one of the categories where we tried to build a robust system to detect when someone falls via the sensors in the cane.
  3. I designed WANDA, an end-to-end remote health monitoring and analytics system for heart failure patients. In particular, we applied sensory EKG data and analyzed them to see if we could detect early onset patterns of heart failures. We got the data from UK NIH and built a remote monitoring capability to monitor the incoming data stream in real-time.

On Engineering Best Practices

Most of the time, the goal in academia is to build something quickly and prove a point — verifying something without too much concern about scalability and practicality on the other end (unless you happen to work in that field of study). In industry, software engineering is a group sport. Very few people can do everything on their own. They have to collaborate with others. The emphasis, therefore, is on writing code that is maintainable over a long time.

For anything that is serious with production-grade in mind, the aim is not to quickly churn out code but to make sure that the code is super maintainable. If you leave the team tomorrow, someone else should be able to step in, quickly pick it up, and carry on with the work. That means there are certain requirements on code quality and documentation around best practices that must be followed throughout the company. In Metaphor, for example, we will not tolerate code below the bar for the sake of shipping things. Most of the time, they are tech debt that we have to pay back in the future with interest.

On Working at Google

In the first part of my career at Google, I focused on a specific product called Google App Engine. This was the marquee product for GCP in the early days and one of the key revenue drivers for them as well. The entire snapshot was built on top of it. You can think of Google App Engine as essentially serverless, as people generally call it nowadays. The developer only needs to focus on the particular parts of the application logic without worrying about scaling the infrastructure and the security consideration around it. It is taken care of by the infrastructure itself, sometimes called function-as-a-service.

I worked specifically on enabling PHP to run in Google Cloud App Engine. Back then, probably even true today, PHP is one of the most popular languages on the web — despite many comments about its quality. Having PHP support was one of the top feature requests from developers when I joined. However, PHP, for better or worse, is inherently a bit insecure. There is a lot of security around the language and various frameworks around it. Trying to run something like that inside Google is a huge red flag for many people. One of the biggest challenges was ensuring we have a good sandbox environment — where we are able to run these things safely. Once again, this was before Docker, so to a great extent, we were building a mini version of Docker inside Google App Engine to ensure you can run your application freely without worrying about the issues mentioned above.

The second project I worked on was a personalization for Google Keyboard (abbreviated Gboard), the standard Android keyboard. How can we suggest predictive typing on the keyboard based on a person’s personalized context? You can think of it as essentially a small search engine within your device that indexes your Gmail contact information. When you type, we will make sure we influence and bias the language model in a way similar to how you usually type in your email or how you have people’s names that can smartly auto-complete. This is an interesting challenge because of privacy concerns. People are not comfortable uploading data to the cloud. Therefore, everything had to be done on the device itself, which is very resource-constrained. We did not want to wear down the device but still be able to provide the best user experience possible.

On Joining LinkedIn’s Metadata Infrastructure Team

I spent about six calendar years in total at Google, so I felt like I was ready to move on to something else with a different environment and a different challenge. LinkedIn, at the time, presented an exciting opportunity — where they had just started to explore the field of metadata. They literally just formed a new metadata team with a couple of engineers and needed a tech lead to spearhead the team. It is an opportunity for me to step up my career and gain leadership experience — which was much harder as a senior software engineer at Google. That is why I jumped on the opportunity.

I did not have much of a data background at the time. I did some ML and whatnot, but it is not quite the same as what people talk about today. I was more of an infrastructure guy focusing on collaborative work at the time. So I said: “Data is the future, but I do not feel like I want to become a pure data scientist. That is not my background by training in any way. Being able to bring my personal experience/expertise to the field would be great. This is a good opportunity as it marries the two fields.” That is why I decided to move from Google to LinkedIn and lead their new team.

On The Motivation Behind DataHub

Literally, one month after joining LinkedIn, I received an order from leadership to drop everything we were supposed to work on and shift focus on GDPR. This was a year and a half away from GDPR enforcement. The original project I was sold on became sidelined. The focus was now on building an infrastructure system to solve LinkedIn’s GDPR problem. DataHub was born out of that.

At the time, LinkedIn had a legacy system called WhereHows, which was also open-sourced but did not get a whole lot of traction. It was a pure data catalog, probably one of the first in this area. The team’s goal was to take WhereHows and scale it to an infrastructure that can enforce GDPR compliance at LinkedIn. To a great extent, we rewrote the whole system from the ground up multiple times. DataHub was version 3 of those iterations.

We followed consumer Internet practices (having a 3-tier system, building micro-services, being mostly stateless, etc.) to build this infrastructure. We could essentially support both original search and discovery use cases and new compliance use cases using the same platform. At the end of GDPR, we realized that we had built a platform that was essentially gathering and organizing metadata to open up new use cases for LinkedIn. We decided to open-source DataHub to the rest of the world so that other people would find it useful.

We worked with a team inside LinkedIn that built ML DevOps on top of DataHub to help people manage the lifecycle of models and features. That entire system is essentially metadata at the end of the day, driving various components and being built on top of the DataHub platform. LinkedIn also had its own proprietary BI software, which migrated onto DataHub. We were able to link the metadata between dashboards and datasets. By the time I left, we had worked with probably 20 different teams across the company. 40 different projects depended on DataHub, one way or another. DataHub ended up becoming a cornerstone for LinkedIn’s data ecosystem.

On The Architecture of DataHub

DataHub is designed to address the key scalability challenges in four different forms: modeling, ingestion, serving, and indexing. Modeling is a hard problem that most data practitioners can relate to. There are not many best practices to follow. You keep evolving your data model over time, so it is important to think about metadata the same. We tried to make DataHub a generalized system that should be able to take on any metadata and utilize them. In order to add a new type of metadata or modify an existing type of metadata, the actual work involved should be minimal. This is challenging because there are many things associated with additional metadata. It is not just a matter of storing the metadata. You want to be able to index them in such a way that you can serve them back to the user through a proper API.

To a great extent, you can think of them as a bunch of co-generators. When you start with your model, you should be able to press the button or run a command. Then your entire infrastructure gets generated out of the model. That is the goal of DataHub. I cannot claim that we have achieved 100% of this goal, but we have managed to reduce the cost of bringing any additional metadata to such a low point. That is why we could onboard many people, despite having another huge team at LinkedIn.

DataHub, for better or worse, chose a LinkedIn-specific language called Pegasus as their data model language. It is not a super popular modern language, but inside LinkedIn, that is the standard language. Think of it as Protobuf to Google, essentially the same sort of relationship. Think of it like an Avro++ language where you write Pegasus files and use them to generate the storage engine, the index engine, and the serving engine.

On Finding DataHub’s Adopters

When GDPR hit, a lot of people felt like that was the worst thing ever. Most engineers are not excited about working on compliance-related stuff. They are probably more interested in building new features and user-facing things. But GDPR was actually a blessing in disguise. Prior to me joining the team, they had a warehouse and tried to go to various teams asking for integration. The goal was to bring metadata into the warehouse and make them searchable. That conversation generally did not go very well because, to most infrastructure teams, this was not their top priority. They had more pressing needs like scaling and running infrastructure. It has always been an uphill battle for internal adoption.

But when GDPR hit, it was the complete opposite. Because of the order coming from upper management, every team had to work with the metadata team to be GDPR-compliant. Now they came to us to figure out the integration. Through that process, we established a pipeline of shipping metadata across multiple systems into us. Once we established that pipeline, it became very easy to add additional stuff on top of it and form this virtuous cycle of bringing in richer metadata to solve even more use cases. Because of that whole process, we were able to get a lot of internal adoption and create momentum. The team at its peak had about 16 engineers, but we had so many backlog items that we essentially had to say no to some other customers. I think you need to have that critical mass in order to get the wheel running. But before that, it was hard, and GDPR definitely helped us to tip over that hurdle internally.

Externally is an interesting scenario. For most full-time engineers working in a company, open-source is their side gig. It is never going to be their main bread-and-butter. For us, we never really tried super hard to push DataHub for external adoption. It mostly came from the goodwill of a couple of engineers interested in doing open-source. We were able to convince upper management to locate a certain amount of team resources into open-source. At least, we were able to resolve open-source issues, run regular town halls, and polish external documentation. Upper management then agreed it was a good idea, specifically for LinkedIn’s brand-building purpose. Once we had that agreement that fosters the project, we could literally locate people to work on a certain part of it without it being a charity-run system. That is how DataHub became the most popular open-source project in this area since we could leverage some of LinkedIn’s resource branding and push it forward.

An excellent book called “Working In Public” talks about open-source communities. Most open-source projects coming out of companies in the data discovery and search area fall into the Stadiums Model: there will be a small group of people (generally the originating company) that contributes the majority of things. Everybody else will be users and occasionally fix bugs. If the originating company is unwilling to invest resources, it is tough to take off the ground. When the project gets to a foundation in which it is no longer a LinkedIn-own project, it can have a life of its own. Before that, someone or some team will have to push it forward.

On The Founding Story of Metaphor Data

When we open-sourced DataHub, we obviously got a lot of interest from the industry. We knew that it would be useful beyond LinkedIn. But the question is: “Can it be useful in general?” Is it only beneficial for a huge company? Or can it be useful for mid-sized companies or even growth-stage startups? Open-source helped us validate that point. It also puts us on the radar out there. As soon as we open-sourced DataHub, a couple of VCs started chasing us and asking when we would leave LinkedIn to start a company around the project. At the time, I felt like I did not want to start a company just because other people wanted to invest. That would be the wrong reason to start a company. You put your next 5 or 10 years of life into this thing, so you better make sure that it is something you really want to do, not just because someone else wants to invest in you.

We did a lot of research and talked to nearly 100 people, data practitioners and data leaders in the industry, in order to answer three questions:

  1. Is this a problem that they have at their scale?
  2. Is there a product out in the market that solves that problem for them?
  3. Are we confident that we can build that product to solve that problem for them?

Only until we get a conviction on all three points do we feel that it is a good time to start a company. At the end of 2020, I, Pardhu Gunnam, and Seyi Adebajo decided to leave LinkedIn and take it to the next level with Metaphor Data.

  1. The third question was the easiest to answer. From early on, we knew we had the right team to build it because we built it for LinkedIn and were one of the very first teams in the industry to build a solution like this. So we felt like we were somewhat qualified for that.
  2. The second question was also reasonably easy to answer. Certain products were in the market, but they were not solving actual problems.
  3. The first question was a bit tricky. There is a book called “The Mom Test,” which I recommend to every entrepreneur who wants to start a company. The gist of the book is that when you ask your mother any dumb startup idea, she will say yes. It is the greatest idea ever. Getting that out of people is extremely hard, especially when they know your intrinsic motivation. Most of the time, people try to be polite and not tell you the truth when you ask them: “Do you have a problem like this? Will you pay for a problem like this?” When you ask those hypothetical questions, you will often not get an honest answer. You will only get the answer that you want. Open-source was a great way to evaluate that. Someone spent a considerable amount of engineering effort to adopt this, so there must have value. Seeing the adoption of the open-source project is a good way to assess whether there is a real pain point people are trying to solve and whether there is the actual value behind solving this pain point.

On Metaphor’s Modern Metadata Platform

The metadata platform is essentially the same as the data platform. A platform, in a nutshell, gathers metadata from various sources, transforms them into a standardized format, and serves them to downstream use cases. The data warehouse is exactly that: You get data from different sources (such as user-facing applications and online transactional databases), put them into tables inside the warehouse and standardize them, and eventually serve them for downstream use cases (such as computing metrics or doing ML). Anything required to have a good data stack can also be applied to the metadata stack.

You probably heard of Fivetran. They are so successful because they serve the critical role of getting data from various sources and putting them into the warehouse. By the same token, the metadata platform must also be good at it because the metadata sources vary. You want to make sure you bring them in a timely manner.

But that is not enough because it still does not solve any actual business use cases. That is why not only do we provide the best platform. We also provide use case solutions on top of that. The biggest one we are focusing on at this very moment is search and discovery and a bit of data governance. We surface up your data ecosystem to different parties of interest: data engineers who are interested in certain parts, data scientists who are interested in finding out what kind of data is available, and business analysts who want to be able to standardize some metrics and build consistent reports. Various stakeholders all want a piece of the action in the data ecosystem. That is the application we provide for these people to collaborate effectively.

On Challenges with Metadata Management

It is not enough to just build cool technology. There have to be compelling reasons behind building something. The culprit in our case is the modern data stack. It helps democratize data to many companies, which otherwise used to be the bread and butter of huge tech companies. You can sign up for Snowflake, Fivetran, dbt, and other tools to quickly stitch together a data stack within a short period of time and a minimum amount of money spent.

The joke used to be that everybody used Informatica for their stack. Now for every piece of the stack, you can find best-of-breed vendors specializing in that. They will be very good at their vertical. Theoretically, you put these vendors together to get the best-of-breed system as a whole. But the problem now is heterogeneity because everything is different.

The modern data stack does not help ensure you are providing a great experience for people — a productive environment where people can collaborate. Thus, it would help if you had a tool that cross-cuts and talks to everybody working with data to bring up that unified view to them.

On Partnership Integrations

Since we aim to give you a complete picture of your data ecosystem, we cannot have missing holes. On the one hand, it is important for us to have good integrations out of the box with the most popular options in the market, such as Snowflake, Big Query, Airflow, Delta Lake, etc. But then there will always be the not-so-popular vendors people are using. We cannot turn around and tell them we do not support them. And especially in bigger enterprises, there will be proprietary systems as well.

The ease of integration, to some extent, is even more important than out-of-the-box integrations. For us, it is equally important to provide the easiest way for you to integrate with our platform. We spent a lot of effort thinking through this entire process to make sure that it is super easy for someone else to integrate with us, not just one-off efforts but on an ongoing basis. If the API changes, we want to make sure the integration keeps evolving with a minimum amount of effort maintaining.

On Finding Design Partners

Any early-stage startup is going to face the problem of finding early customers. It is even a bit more challenging for us than for the traditional enterprise software because you can break them into two groups:

  1. The first focuses on product-led growth, where you can start from the grassroots. dbt is a good example. If you have a one-person team, you can use dbt happily and find great value in it. You can find even more value when you have a 10-person team. Fivetran is another example. If you have one API, you can pull the API for usage even with a bit of data. You only pay for whatever you use, no matter where you are in the spectrum. Obviously, the bigger companies generally get more value.
  2. For the metadata platforms and data catalogs of the world, you need to have a certain amount of people to make it worthwhile. If you have a one-person team, you probably do not need a data catalog. If you have a 3-person team, then maybe or maybe not. As soon as you hit some critical mass, then the problem becomes super important. If you talk about big enterprises, this is a no-brainer since all of them have this problem. They needed a solution yesterday.

For us, it is about finding the right company at the right time. When we talk to Bank of America or Chase, for example, they have problems like this, but are they going to work with an early-stage company? Probably not. You have to work with them for a year or even two to score that contract. That is not the position most early-stage companies want to be in. They want to be able to scale quickly early on. As a result, it is about finding companies that are not so big but still have red tape and can find value in our solution.

In other words, we want to find the sweet spot where the companies are small and nimble enough; so when they start having this problem, they are willing to try an early-stage startup as a vendor solution.

On Hiring

Early-stage startups are not built for everyone, to be honest. You need to have the right set of mentality to work in early-stage startups. That generally translates to the ability to be a bit scrappy, to deal with a fast pace, and to wear multiple hats. Those are some characteristics that are unique to early-stage startups and some later-stage startups. Hiring the right kind of people is crucial so they can thrive in this environment.

If you are very used to the pace of big companies, joining a startup can be a cultural shock. People working in a big company for many years tend to be less suitable. Generally, people who dabble in startups frequently also tend to perform better in startups. People in the middle of their careers, who have already learned a lot from established companies and want to take on brand new and completely different challenges, are the ones most suited for this stage.

At the same time, this is the most exciting stage for the company, in my opinion. You build stuff from 0 to 1 and scale things you have never seen before. If the company is on a successful path, the growth will be phenomenal, and it will be something you will never be able to see in a big company. If those things interest you, joining early-stage startups such as Metaphor will be a good choice.

On Culture Building

As an early-stage startup founder, you are often trying to put out fires left and right. You are trying to build products, bring in customers, and hire the team as quickly as possible. But if that is all your focus and attention, then that is a huge mistake. Culture is one of those things that, if you do not grow it, it will just grow everywhere. When you recognize it, it is already too late to correct it. The first 10 or 20 people you hire essentially determine the cultural values of the entire company for the rest of its lifespan. It is critical to strike a balance between being practical in getting things off the ground as quickly as possible and spending a good amount of time thinking about what kind of company you are trying to build / what kind of culture you are trying to instill.

Pardhu and I have spent significant time making sure that Metaphor Data has good cultural values. Cultural values do not just exist in some static Wiki pages. You have to actually practice in real life as well. You must keep reminding people and iterating on what is important to the company and what is not. Especially when the time is tough, which you have plenty as a startup, and we are faced with difficult decisions, we always go back to those values and ask whether we are taking actions based on what we believe we should be doing. Almost always, we can find our answer there.

Every company is going to have different values. But if you stick to them, you end up growing a company that you want to grow. Shaping company culture entails setting it up, making sure you follow through, repeating it over and over, and using it as a guiding post for your most critically important decisions. If everything works out well, you will have a company you want rather than a company you cannot recognize.

On Fundraising

Picking investors is supercritical. If you pick an average investor, you will probably be okay. But if you pick a bad investor, you will have a very bad time. We are lucky to have top-notch investors who make our lives much easier. That being said, one piece of advice we received during our fundraising is to ensure that you run an actual process. These VCs and investment firms come out of the blue, not a timeline you set. You meet some early and some late. Thus, you want to make sure that you run an actual process to give everybody an equal chance and evaluate them fairly and squarely. Do not just take the first offer that comes in. When it comes time to make a decision, you have all the data points from different conversations in one place to make an educated decision. But the reality is that if you go with top-tier VCs, generally, you cannot go wrong.

We have really enjoyed working with a16z and Amplify. They have seen a lot of things and been able to tell us not to repeat mistakes they have seen before. That alone in itself is super valuable. The joke is that early-stage startups which succeed are those making the fewest mistakes. Being able to get wisdom from people who have been in the field for a long time is super helpful.