Datacast

Episode 69: DataPrepOps, Active Learning, and Team Management with Jennifer Prendki

Episode Summary

Dr. Jennifer Prendki is the founder and CEO of Alectio, the first startup fully focused on DataPrepOps. She and her team are on a fundamental mission to help ML teams build models with less data. Before Alectio, Jennifer was the Vice President of ML at Figure Eight. She also built an entire ML function from scratch at Atlassian and led multiple Data Science projects on the Search team at Walmart Labs. She is recognized as one of the top industry experts on Active Learning and ML lifecycle management. She is an accomplished speaker who enjoys addressing both technical and non-technical audiences.

Episode Notes

Show Notes

Jennifer’s Contact Info

Alectio’s Resources

Mentioned Content

Talks

Articles

1 — Women vs. The Workplace Series

2 — Management Series

3 — Responsible AI Series

Book

Notes

Jennifer told me that Alectio is about to launch a community version that people will be able to compete to get the best model with the minimum amount of data this fall. Be sure to check out their blog and follow them on LinkedIn!

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are highlights from my conversation with Jennifer:

On Studying Physics

As far as I can remember, I wanted to be a physicist. The great physicists of the past widely inspired me. My entire childhood was geared towards meeting it. I eventually ended up getting a Ph.D. in Astrophysics and Particle Physics at Sorbonne University. During my Ph.D., I also studied matters and anti-matters at the Stanford Linear Accelerator Center. I started realizing that there are many better research opportunities in the US than in Europe.

Unfortunately, I graduated in 2009 — the beginning of the economic recession. I was very peculiar in the type of physics that I want to work on. Another way of studying the principles of matters and anti-matters was via neutrino physics. As a result, I did my Post-Doc study on neutrino physics at Duke University. However, a few months after I joined, there were more restrictions on academic grants, and we received less funding than we thought. It became increasingly clear that I wouldn’t be able to work on the research that I felt passionate about. It was heartbreaking for me, but I wouldn’t be where I am today if that didn’t happen in hindsight.

On Transitioning From Academia to Industry

Even though I was trained as a physicist, I already had a lot of experience collecting data, analyzing data, and building statistical models for my physics experiments. It was natural for me to recycle my skills as a physicist by going into the industry (typically in finance back then).

To be completely honest, I really didn’t like my job working as a Quantitative Research Scientist at Quantlab Financial very much. Fortunately, I was still able to gain relevant technical skills. During the 2nd year, I was able to work on an NLP model to predict what happens in a stock market by analyzing the news. After that, I started to interview for roles doing machine learning in the industry.

On Getting Into Data Science

In finance, I often worked with Black Scholes equation or time series models. I felt the urge to work on more sophisticated modeling methods. Around 2014, data science started to become popular. The big tech companies started to put more emphasis on their data initiatives.

After my time at Quantlab, I tried to move to places where I could do things from the beginning. I decided to go to YuMe since they recently acquired a company whose CEO is John Wainwright, the first customer of Amazon. John also pioneered pure object-based computer languages, which serve as the core behind game development and 3D animation technologies used today. For me, the opportunity to work with John was more than any potential salary or benefits. However, by the time I joined YuMe, John has resigned.

I ended up in a weird situation where I didn’t know exactly what the company wanted me to work on as a data scientist. I was forced into (almost) a managerial position making high-stake decisions for the company when I barely knew about data science myself. I quickly realized that I enjoyed this a lot; for instance, I was very efficient in communicating with engineering teams. I figured that what I was really good at was more data strategy than model development.

YuMe gave me many opportunities that a manager has, but they didn’t give me the budget going with that. It took a bit of time for somebody to trust me completely. I learned back in those days that if it’s not the right place for you, you shouldn’t be afraid to move. For me, it’s crucial never to lose the trust of what I need for my career.

“If I’m not with the right people or in the right environment, I shouldn’t be afraid to make the decision that enables me to grow.”

On Measuring Data Science ROI at Walmart Labs

When I joined Walmart Labs, I started in an individual contributor role as a principal data scientist. Later on, the muscle that I wanted to exercise more was working on more large-scale initiatives (which would be easier in a big company).

During this period, a typical data science team comprised of people with Ph.D. degrees researching model development. However, many companies were struggling to convert those models into something that actually made money. There was no communication in such a large company (like Walmart) between the business and research sides. The data scientists were completely cut off from the business goals.

“To become an effective data scientist, you need to understand what the C-suite people want to achieve.”

Inside the Metrics-Measurements-Insights team that I managed, we talked to all the stakeholders in different teams, identified ways to measure success for them, found proxies for different measurements, and communicated whether we were moving the needle for them. You have to keep in mind that a company that started a data science team wants to see the ROI. If you are here building models that do not help people, there’s a chance that you would get laid off. Thus, companies must have this sort of initiative.

For me, it’s never too early for anyone who wants to enter the data space to understand that: Data is at the service of the business. Companies invest in big data initiatives to sell more products, attract more customers, make things easier, etc. If you don’t keep in mind that there is a business goal at the end of the day, you’re going to fail.

On Giving Industry Talks

I came from academia, where I was speaking all the time. I considered public speaking as an opportunity to grow, network, and influence the market. When I went to the industry, one heartbreak was: “Will I ever have the same opportunity again to be a thought leader?” That was what’s missing during my prior positions at YuMe and Ayasdi.

While at Walmart Labs, on an accidental occasion, my boss was looking for somebody to deliver a talk at MLconf (one of the largest ML conferences in the US). I got my lucky break and delivered my first industry talk there (“Review Analysis: An Approach to Leveraging User-Generated Content in the Context of Retail”). Given my previous experience as a speaker, the talk went relatively well. From that point on, my organization started trusting me to give more talks.

“Giving talks is a unique way to evangelize and convince people to change things.”

My sense back then was that the industry was doing data science inefficiently and going to face the next AI winter if nobody does anything about this. I wanted to express my own view in front of the market for that exact reason.

On Active Learning

The traditional way of building an ML model is supervised learning: given an acquired dataset, you annotate it and use it for your model. Active learning is basically a specialized way of doing semi-supervised learning. In semi-supervised learning, you work with some annotated data and some un-annotated data. In active learning, you prioritize strategic pieces of data by going back and forth between training and inference. You take a small set of data, annotate this data, train your model with this data, see how well the model performs, and think about what piece of data to focus on next.

A popular way of doing active learning is to look at some measurement of model uncertainty. You train your model with a little bit of data and perform inference on the rest of the dataset (which is not labeled yet). You can say: “It seems that the model is relatively sure that it makes the right predictions on these classes of data, so I won’t need to focus on them.”

I started using active learning at Walmart Labs because of our ridiculously small labeling budget. Furthermore, I realized that a big problem with regular active learning is that it needs to be tuned. Active learning is the principle of doing things in batches (or loops), but practitioners still don’t know quite yet how to pick the right number of batches smartly.

I saw an analogy between active learning today with deep learning 10 years ago.

Another challenge with active learning is that it is compute-greedy. Although it saves you on the labeling cost, you have to retrain your model regularly. Every time you have to restart from scratch, the relationship between the amount of data and compute is N². Given this tradeoff, you have to see whether it makes sense to waste compute for the sake of saving labels.

“At Alectio, we are building cost-effective active learning strategies.”

On Scaling The Search & Smarts Team at Atlassian

You can imagine building a team like this similar to an intrapreneurship endeavor. The profile of people who want to help you are basically the profile of people you want to hire in a startup. Obviously, you want people with the right type of diploma and relevant skillsets.

“But building an ML team is also not just about hiring the people who can build the models.”

On Organizational and Operational Challenges with Enterprise ML

What does it take to be successful with ML as an organization?

There are 3 components: the technology, the organization, and the operation. Today, the industry is good at technology. However, it’s evident everywhere I’ve been that we suck at the organizational and operational aspects. I’ve advised a lot of small and large companies alike. Nobody got it right.

When we think about the ML lifecycle, we have data preparation (getting data into the right shape), model development (by default, what people think of when it comes to ML), and model deployment (putting models into the real world). There have been many investments from the VC community in model development and model deployment tools.

“The one thing that has not been operationalized properly thus yet is the data preparation piece.”

Asking any data scientists out there how they spend their time, they will say 75–80% of the time on feature engineering, data labeling, data cleaning, etc. I believe that there are not enough investments and companies bothered about data preparation. It’s also worth noting that data preparation goes beyond data labeling and data storage.

On Agile for Data Science Teams

Agile is already kind of an old concept serving as a response to the waterfall methodology. The core concept behind agile is responding to the unexpected. The end goal is the same, while the path to the end goal might be different. Agile is a brilliant idea for engineering as a mechanism to combat reactivity. Atlassian kind of forced every team to adopt the agile methodology, including the ML teams.

“While everybody understands agile for engineering, people in ML research will tell you that they don’t know how long it’ll take to build their models.”

For me, selfishly as a manager, I did think that machine learning needs something similar to agile. By helping researchers break down their tasks into small pieces, we could incorporate more predictability into their workflow. The ML scientists had to think about data collection, model validation, a list of models that might work for the current tasks, etc.

Another major finding for me was that: a combination of different agile frameworks is still agile. The ML team consists of both the engineers and the scientists, so I used separate frameworks for each. When a scientist comes up with the right model, he/she will assist the engineer with implementing that model at scale. This approach enabled us to respond to the agile requirements and make a huge difference in efficiency. I wished more people would talk about the necessity of project planning for researchers, in general.

On Joining Figure Eight

Figure Eight used to be called CrowdFlower, the first enterprise-grade labeling company. Going back to my previous job at Walmart, I suffered a lot from not having enough budget for data labeling. I felt very passionate about that problem, which led me to look at research in active learning. Even during my time at Atlassian, I spent a lot of time getting the organizational part right, but not on the operational part. I enjoyed my time at Atlassian, but back then, they haven’t been ready yet for large-scale data projects. I wasn’t a very patient person, so that experience was frustrating.

Figure Eight reached out to me, discussing smarter ways of doing the labeling. Even today, most of the labeling is still done manually. With datasets becoming larger, there needs to be an ML component to bring ML as one potential way to automate (at least partially) the labeling process. I believe that we need less data than we have in many circumstances.

“One way to help customers do things more efficiently is not to scale up labeling but to scale down data.”

The executive team at Figure Eight was very interested in this idea, but the board was not onboard. As a labeling company, Figure Eight made money based on the volume of annotated data, which signaled a change in the business model. Figure Eight was already 9 years old when I joined. They received an acquisition offer quickly, which meant that it was way too late for me to pivot the company from a hard-core data labeling company to a data curation company.

Regardless, I enjoyed my time at Figure Eight as I learned a ton about the labeling space. I strongly believe that sometimes you need to take jobs that don’t make sense in a career because there would be serendipitous opportunities as consequences.

On Founding Alectio

I considered myself a reluctant entrepreneur. My original thesis is that we should stop believing big data is the only solution for building better ML systems. I tried to evangelize this thesis at all of my previous employers. Eventually, I realized that nobody was tackling this huge problem. The concept of “Less Is More” is popular in our society, but not so much in big data. There’s no doubt that big data unleashes real opportunities for ML. However, we are currently facing the reverse problem: we build bigger data centers and faster machines to deal with the massive amount of data. To me, this was not the wise approach. From an economic perspective, you can easily understand why some large companies have the incentive to make everybody believe that big data is necessary — the more data, the more money they will make.

I think about the sustainability of ML in two ways: (1) sustainable environments (fewer data centers and less electricity used for servers) and (2) sustainable initiatives from large organizations. Many problems have come from the scale of data that we need to tame. Any dataset is made up of useful, useless, and harmful data:

“For me, ML 2.0 is about demanding higher quality from the data.“

However, there is a distinction between the quality and the value of the data. Value depends on the use cases. Data that might be useful for model AI won’t necessarily be useful for model B. Thus, you need to perform data management in the context of what you are trying to achieve with it. None of the data management companies are doing this today.

Alectio’s mission is to urge people to tame down the data and come back to sanity to some extent.

On Responsible AI

We want AI to be fair to the consumers. Delivering fair AI means that everybody has access to the same technology and benefits from the technology in the same way.

“We want AI to be the solution to the unfair society that we live in.”

One thing that scares me with the progress in AI is the disappearance of blue-collar jobs. This is not a bad thing. We want people to move on to different jobs that are not dangerous. However, if we continue our current path, the rich get richer, and the poor get poorer. An incredible example of this is data labeling:

There’s a huge opportunity to have poor people in those countries benefit from the AI economy via data labeling. But we need to ensure that we do not increase social disparity because of AI. I think there must be regulations in terms of how much they get paid.

On Finding Customers

When starting Alectio, I went to trade shows on my own. When I told people: “Did you know that you could build the same model with fewer data?” people would laugh at my face. We were living in an age where people were taught by academics, bosses, and friends that more is better. For me, in particular, there was a lot of education. There were a lot of challenges with people not trusting an early-stage company.

Furthermore, people sometimes told me: “If this could be done, I think somebody could have done it.” Why is nobody else doing it? Or does that mean I (specifically as a woman) can’t do it? It’s fun at the beginning to push my limits and respond to such objections.

On Hiring

Alectio is building an engine to identify useful data in large datasets contextually to a model. This is a meta-learning problem. Essentially, Alectio is a meta-learning platform. It’s incredibly easy to attract talent. I would even say that we are the one true data science company, giving people the opportunity to build models and the tools to diagnose their learning mechanisms. As far as I’m concerned, this is the holy grail in ML.

Oftentimes, the people best suited for the job aren’t the ones that you think. I have often interviewed people with Ivy League degrees and impressive resumes and ended up hiring people with less impressive credentials. Trying to push yourself out of your comfort zone is one of Alectio’s values.

“I hire mostly for the ability to push oneself and the ability to learn new things.”

This is true for any technology job. There were countless situations where I interviewed people with Ph.D. and Post-Docs, and they did worse compared to people with Masters and Bachelors.

On Taking Advice

On Navigating Tech As a Woman

A mistake that I have made over and over again is to behave like a man. When stepping into a manager position, you are likely to take your boss as a role model. It’s important to be yourself. This is true for everyone, but particularly for women.

Keep learning things. Never shy away from doing something that you are not good at. Going outside of your comfort zone.