Datacast

Episode 118: Overcoming Hardships, Confident Learning, Dataset Improvement, and The Ph.D. Rapper with Curtis Northcutt

Episode Summary

Curtis Northcutt is an American computer scientist and entrepreneur focusing on AI to empower people. He is the CEO and Co-Founder of Cleanlab, building next-generation data-centric AI and open-source technologies that enable AI to work with real-world, messy data. He completed his Ph.D. at MIT, where he invented confident learning to automatically find label issues in any dataset. Curtis received the MIT thesis award, NSF Fellowship, and Goldwater Scholarship for his work. Before Cleanlab, he worked in AI research teams at Google, Oculus, Amazon, Facebook, Microsoft, and NASA.

Episode Notes

Show Notes

(01:56) Curtis reflected on his upbringing in rural Kentucky and his gift of education.
(07:20) Curtis explained how he cultivated mental focus and intellectual fortitude while growing up in Kentucky.
(10:30) Curtis shared his view regarding online misinformation on social media.
(14:27) Curtis recalled his undergraduate experience at Vanderbilt University in the early 2010s.
(22:39) Curtis explained how he learned best via teaching and mentoring.
(24:04) Curtis walked through the research and industry experiences he obtained throughout college.
(32:45) Curtis recalled his decision to embark on a Ph.D. in Computer Science at MIT.
(38:53) Curtis told the story of how he ended up finding his advisor - Professor Isaac Chuang (the inventor of the first working quantum computer).
(40:36) Curtis mentioned how he invented the CAMEO Detection Algorithm to detect “multiple-account” cheating in massive open online courses.
(44:47) Curtis unpacked his Ph.D. research on dataset uncertainty estimation.
(50:08) Curtis dissected confident learning, a family of theories and algorithms for supervised ML with label errors.
(53:22) Curtis encapsulated how he strategically iterated cleanlab at his various graduate internships.
(01:00:22) Curtis recalled his time founding his first startup ChipBrain, before founding Cleanlab.
(01:06:42) Curtis brought up the creation of the labelerrors.com project.
(01:12:12) Curtis provided lessons learned as a second-time founder.
(01:14:25) Curtis elaborated on the open-source roadmap of cleanlab.
(01:17:08) Curtis highlighted the key capabilities of Cleanlab Studio - the no-code, automatic data correction solution for data and engineering teams with robust enterprise features.
(01:18:50) Curtis touched on Cleanlab Vizzy - an interactive visualization of confident learning.
(01:20:29) Curtis shared valuable hiring lessons to attract the right people who are excited about Cleanlab’s mission.
(01:23:23) Curtis gave his thoughts on shaping Cleanlab’s culture.
(01:26:06) Curtis explained the similarity and differences between being a founder and a researcher.
(01:29:09) Curtis mentioned how he had helped researchers build affordable state-of-the-art deep learning machines.
(01:31:46) Curtis brought up his alter ego PomDP the Ph.D. rapper, and how rapping has been an outlet for him to express emotions and creativity.
(01:40:12) Curtis emphasized how his success had been due to a function of grit, resourcefulness, and friends made along the way.
(01:44:04) Closing segment.

Curtis' Contact Info

Academic Website
LinkedIn | Twitter | Facebook | Instagram
Google Scholar | GitHub
PhD Rapper (YouTube | Spotify | SoundCloud | Facebook | Twitter | Instagram)
L7 Machine Learning Blog

Cleanlab's Resources

Website | GitHub | Slack | Twitter | LinkedIn
Blog | Research | Doc
About | Careers
Cleanlab Studio
Cleanlab Vizzy
The Cleanlab Culture

Mentioned Content

Papers

Detecting and preventing “multiple-account” cheating in massive open online courses, Curtis G. Northcutt, Andrew Ho, & Isaac L. Chuang, Computers & Education, 2016. [paper | code | arXiv]
Comment Ranking Diversification in Forum Discussions, Curtis G. Northcutt, Kimberly Leon, & Naichun Chen, Learning at Scale, 2017. [paper | code | free-access]
Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels, Curtis G. Northcutt, Tailin Wu, & Isaac L. Chuang, 33rd Conference on Uncertainty in Artificial Intelligence (UAI 2017). [paper | code]
Confident Learning: Estimating Uncertainty for Dataset Labels, Curtis G. Northcutt, Lu Jiang, & Isaac L. Chuang, Journal of Artificial Intelligence Research (JAIR), Vol. 70 (2021). [paper | code | blog]
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, Curtis Northcutt, Anish Athalye, and Jonas Mueller, 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks [paper| demo | code | blog]

Blog Posts

Founder’s Medal recipient chooses MIT over Microsoft (May 2013)
Build a Pro Deep Learning Workstation... for Half the Price (Feb 2019)
An Introduction to Confident Learning: Finding and Learning with Label Errors in Datasets (Nov 2019)
Announcing cleanlab: a Python Package for ML and Deep Learning on Datasets with Label Errors (Nov 2019)
Double Deep Learning Speed by Changing the Position of your GPUs (Dec 2019)
Benchmarking: Which GPU for Deep Learning? (Dec 2019)
The Best 4-GPU Deep Learning Rig only costs $7000 not $11,000 (April 2020)
Pervasive Label Errors in ML Datasets Destabilize Benchmarks (March 2021)
Cleanlab: The History, Present, and Future (April 2022)
cleanlab 2.0: Automatically Find Errors in ML Datasets (April 2022)
How We Built Cleanlab Vizzy (August 2022)

Talks and Podcasts

Tedx Talk: The MIT Rap Challenge (July 2020)
Talk at NLP Summit (March 2022)
Talk at Data + AI Summit (June 2022)
MLOps Coffee Chat (July 2022)
Talk at Snorkel's Future of Data-Centric AI Conference (July 2022)
Open-Source Startup Podcast (March 2023)

People

Book

Play Bigger: How Pirates, Dreamers, and Innovators Create and Dominate Markets (by Al Ramadan, Dave Peterson, Chris Lockhead, and Kevin Maney)

Notes

My conversation with Curtis was recorded back in August 2022. The Cleanlab team has had some important announcements in 2023 that I recommend looking at:

The launches of CleanVision, Datalab, and ActiveLab
This blog post on using Cleanlab to improve LLMs
His new single "Clarity In My Vision"
Cleanlab's partnership with Databricks (Video)

Cleanlab is about to announce its Series A announcement soon. Stay on the look for it!

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are the highlights from my conversation with Curtis:

On His Upbringing

I grew up in the countryside outside of Lexington, Kentucky. I had one neighbor. After a lot of hard work, I've ended up where I am today.

My dad was a mailman, as was his dad. If I had been a mailman, I would have been a fourth-generation mailman. My mom didn't work when I was young, but that changed due to bad decisions that put us below the poverty line. We shopped at Save-A-Lot, which is known for its low-quality food. Sometimes we didn't have food at all, and we had to make do with cheese packets from Kroger.

Despite the lack of resources, I excelled in school and eventually received a scholarship for $500, which I used to apply for the Governor Scholars Program. From there, I applied to several colleges and was ultimately accepted to Vanderbilt, a top 20 school in the US.

I continued to build up my experiences, working at General Electric and Notre Dame, eventually finding my way to research. Growing up, I felt frustrated and strangled, but through hard work and determination, I was able to break free from my background and start contributing to the world.

On Mental Focus and Intellectual Fortitude

Many people who don't come from big cities may actually have a lot of potential. If you're listening to this podcast and you're not from a place with the best resources, and neither of your parents are doctors, lawyers, professors, or CEOs, you might feel like you don't have a chance. But I believe you may be in a position of greater strength.

The reason for this is that you can focus. You don't have a million opportunities constantly bombarding you, which can be a distraction. You don't have the same societal pressures that come with living in a big city. When I was growing up in rural Kentucky, there was nothing but grass and trees. But I had time to think about who I am, what I want to do in this world, and what my values are. I didn't have people telling me what to think, and the internet was still in its infancy.

Nowadays, everyone has access to social media and the latest apps. However, if you come from a place with fewer opportunities, you have less commercialism, marketing, and information being bombarded at you. You can use this to your advantage by focusing on what you want and seizing the opportunities that come your way. When you view the world from this perspective, you'll realize that what you think is holding you back is actually your strength.

On Fixing Misinformation

I only use Twitter and social media for business, empowerment, and advancing the things I care about, such as making the world a better place through technology and artificial intelligence.

However, when I look at the "recommended for me" section on Twitter, it recommends irrelevant content such as Pamela Anderson's appearance or the latest celebrity gossip. While I support the media and acting genres, that's not why I'm on Twitter. I'm seeking information that helps me achieve my goals, such as empowering people and building technology to help others. I run Cleanlab and want to make it the best company for our customers.

Twitter knows what interests me, yet they recommend content that generates ad revenue instead of empowering me with better information. This is a problem for society, as we're flooded with misinformation that benefits big businesses and media organizations rather than helping us become better individuals and contributors to society.

This is not a unique problem to Twitter, as most social media giants are driven by ad revenue. However, we can do something about it. My company, Cleanlab, addresses misinformation by fixing errors in data sets, label errors, and other issues. Our goal is to have accurate information for machine learning models and individuals, so we can create a better world without biases that only benefit big corporations.

Let's create a world that we truly want, not one that solely generates profits for large companies.

On His Education at Vanderbilt University

My experience at Vanderbilt was figuring out what I wanted to do. When I got there, I wanted to be a patent lawyer because that's what Einstein did. I'm a big fan of Einstein. I consider him a human with superpowers. All humans have superpowers, actually. It just takes time for us to figure out what they are.

A lot of people don't like that terminology. They think it's fantasized. But I completely disagree. I think it's exactly what it is. And part of your undergraduate experience is figuring out your superpowers. And when I say superpowers, I don't mean shooting lightning out of your eyeballs. Your superpower is what you can do that you notice other people can't do as easily.

We're all a little bit better at certain things than other people. If you can harness that energy to create something incredible, you have discovered part of your purpose on Earth. A lot of people struggle, especially in modern society, to know what their purpose is. You get your job, standard tech job, or whatever after college. And then you work at a big company, and after four years, you're like, wait, what am I doing with my life?

If we used college to figure out what our superpower is and what things we're really good at and enjoy, the world might be a place that had a little more certainty and a little more confidence in what we do. I spent Vanderbilt trying to figure out what my superpower was.

I knew I cared a lot about education because education was what got me from rural Kentucky to where I am today and where I was at that time. And I also knew I liked math and science. I thought it was worthwhile, and I felt like that's what the world is built on top of. I wasn't very interested in business at the time because my dad was a mailman. I didn't grow up in a home that had any business. The notion of business was like this thing of rich versus poor.

I grew up in a home that had very little money. When I was growing up, we always thought that people in business were money-focused. That's completely wrong. People in business are trying to create value for customers, and they have to make money to do that. Some people in business want to make money, and that's their goal. Totally fine. That's everyone's right to the pursuit of liberty, happiness, and freedom. And if that implies that they want to make money, then they should absolutely do that.

I want to make money. Lots of people want to make money. But I have bigger goals with society, and I can't achieve those goals without business. I emphasize that now because you asked me about my academic experience at Vanderbilt, and it had nothing to do with business at the time because I did not see just how necessary it was to achieve a greater vision for the public good.

The overall experience for me was to figure out what I liked. I started out wanting to be a patent lawyer but ended up trying every major class in engineering. I went broad, as broad as possible, and studied all of them. I realized that I could create any company in any other field with computer science. There is no field that's not touched by computer science.

So I put all those things together into my degree and used that to figure out what the next steps of my life would be. But I left out something really important, which is that another great part of the academic experience, especially undergrad, is to try to meet people that are unlike people you've ever met before. Meet people that change your mind and stretch your mind about what is capable of a human being. Be grateful when you meet someone like that, and try to meet as many people like that as you can. Undergrad is a really great place to do that.

On Teaching and Mentoring

There are many ways to learn: writing, talking, listening, mimicking, watching videos, replaying and interacting with audio clips, and teaching. Personally, I learn best through teaching. As president of Tau Beta Pi, I taught others how to build a society focused on good engineering and building things that benefit society. Teaching solidifies my understanding of a subject and leads to internal discovery.

For example, doing this podcast allows me to reify my own thinking and share ideas in new ways. Anyone who advises or teaches will find that they become an even deeper expert on the subject. This is a huge benefit for me.

On Gaining Research and Industry Experiences

During my undergraduate studies, I constantly worried about the practicality of my learning. I was scared I couldn't apply the theories and equations I was learning to real-world problems. I was taking classes in subjects like numbers theory, modern algebra, and equations like Riemann and Newton's method. Although I found the theories fascinating, I became increasingly concerned about whether they would help people in the real world.

To gain practical experience, I made a conscious effort to work every summer in places focused on projects meant to help people. After my first year in college, I worked for General Electric, where I built an engineering platform to support all their onboarding of engineers. It was not the job I would have liked, but it gave me experience and helped me to understand the operations of a big company like General Electric.

I then worked on a project called Photo Plethysmography for the NSF, which involved building a technology that could detect heart rate by measuring blood flushing into the face using a camera. It was a fascinating project and helped me understand how technology could be used to solve real-world problems.

After that, I won a coding competition and was offered a summer internship at Microsoft. This was a big change for me, as I had never worked at a big tech company before and didn't know what to expect. It was a great experience and helped me better understand the tech industry.

I was still in the process of building my career, so I declined the offer from Microsoft to pursue a Ph.D. I applied to top computer science schools like CMU, Georgia Tech, and MIT, not knowing if I would get accepted. However, I was accepted to all of them and ended up attending MIT.

Before starting graduate school, I worked as an intern at MIT Lincoln Lab, where I gained some hands-on experience in computer science. It was an incredible experience and helped me learn as much as possible before starting grad school.

Throughout my undergraduate studies, I was constantly reminded of the challenges of being from rural Kentucky and not having the same resources as other students. However, I refused to let that discourage me. I worked hard and applied to every scholarship and fellowship I could find, building up a strong portfolio of accomplishments that helped me get into the best schools and programs.

Looking back, I feel incredibly proud of what I was able to accomplish. I was able to turn my fears about not being able to apply what I was learning into an opportunity to gain practical experience that would help me make a real difference in the world. I hope that my experiences can inspire others to work hard and pursue their dreams, no matter where they come from or what resources they have at their disposal.

On Pursuing a Ph.D.

So, I chose to attend MIT instead of going to Microsoft for undergrad because of a conversation I had with my hiring manager at the time. Microsoft was a big deal for me because of where I'm from, but my hiring manager informed me that I would earn around 150K per year if I worked at Microsoft. Going for a Ph.D. would take six years, meaning I'd miss out on almost a million dollars, not accounting for raises. However, I realized that if I went for a Ph.D., I could always go back to Microsoft. If I got a Ph.D. from MIT, I'd be more educated and capable of doing any job they wanted me to do as an intern from undergrad at Microsoft, and I'd be able to do a lot of other jobs too. This would increase the likelihood of doing something I'm interested in, something that I care about, and something I can do an outstanding job at.

In computer science, you can do internships and make money while doing grad school, which is something my hiring manager did not take into account. You can also start companies and have blogs or websites. You can do all sorts of things if you want to make money. But for me, it was about creating the ability to have upward and horizontal mobility and surrounding myself with people who would enrich me for the rest of my life. While doing a Ph.D. at MIT, the people I spent time with changed my life permanently.

One issue related to this topic is the lack of female role models in STEM. If you search for "best female role models" online, the results are primarily actresses and civil rights activists. This is problematic because if you're a little girl searching for who your role model should be, you might think that your appearance is more important than your brain or that fighting for emotional and social movements is more important than intellectual strength and discourse. However, when I was at MIT, I was surrounded by some of the most incredible people, including women like Regina Barzilay, who built NLP systems to solve cancer, Dina Katabi, who built incredible cybersecurity systems; Daniela Rus, who runs CSAIL; and Tamara Broderick. These faculty at MIT redefine what young women can do in the world. I wanted to be in a place like that, surrounded by world leaders and people who redefine genres.

In terms of wealth, the amount you can generate once you understand how the world works and how to build incredible things is incomparable to working as a PM at Microsoft. In the beginning, you might take a pay cut and lose some money, but in the long term, there's no comparison in terms of what you'll be worth in 30 years. Even if it's not in dollars and cents, the value you'll have, confidence in yourself, and the ability to do incredible things are incomparable. Going to a place like MIT is going to lower your confidence in many ways, but your abilities will be unparalleled compared to working a job as a PM. Plus, you'll work a lot more, but it's worth it because you'll work for yourself, doing something you love.

On Being Advised by Professor Isaac Chuang

Isaac was not only a professor of physics and computer science but also interested in online education and had invented the first working quantum computer. He was a fascinating guy to have at MIT.

At the time, I wanted to do my Ph.D. in education and AI to empower people and help them learn. Isaac was the right person to go to since he shared my interests. Interestingly, he was from Kentucky.

I consider myself a scientist in many ways because I view the world through the lens of asking questions. I find things interesting and ask questions about why they are the way they are and whether they could improve.

If you go through life just observing things without asking questions, it's hard to learn anything without being told. But if you constantly ask questions, you'll often come up with answers on your own and learn entirely on your own, getting smarter without anyone to teach you. It's a cool thing.

Isaac fostered that scientific mindset in me more than anyone else, and it felt very natural for me. We were a great fit working together because of our shared passion for education and science.

On Inventing The CAMEO Detection Algorithm

The problem with online courses, both in the past and present, is cheating. When online courses first came out, people were creating two accounts. On one account, they would copy all the answers by clicking the wrong answer, clicking "show the answer," copying the displayed answer, opening another browser window, signing in with a different account, pasting the answer, and hitting submit. Tens of thousands of people, if not hundreds of thousands, were doing this to earn certificates.

As a researcher at MIT and Harvard, it was my job to investigate the data and discover this cheating. That's when I developed a cheating detection algorithm called the Cameo Cheating Detection Algorithm, which is now used by both universities to detect/prevent cheating and validate certificates.

My goal was to democratize education through online courses like edX and Coursera, empowering people with AI. But if people cheat and earn certificates, it ruins the value of those certificates for those who earned them honestly. I spent the next two to three years validating certificates and building an algorithm to detect cheating so that people who earn certificates can have a better life and have faith in the value of their certificates.

I was shocked to discover how pervasive cheating is. It's a serious problem because it undermines the value of certificates. There were two things I learned from this experience. First, there is a lot of bad data in the world, including certificates that are not meaningful. Misinformation and errors in data have implications for society. Second, machine learning and AI have no solution for training algorithms on noisy data.

I realized that I needed to shift my focus to building a field that makes AI work for real-world, noisy data. This shift was pivotal in shaping my work for the next 10 years.

On His Ph.D. Research

When I began working on cleanlab, my goal was to apply it to the education problem. I discovered that training an ML model on bad data can provide insights into class distributions and different statistics about how confident the probabilities are in each class for a given example. Based on the label, you can see if there are any issues. I continued to expand on this idea and build theoretical guarantees that could show for real-world datasets using any model. You can get exact error finding as long as the model is reasonable.

We found that this method does not hold in real-world data sets because the probabilities are pretty bad. So we started doing more work to make it more robust. I did more statistics and class work to figure out when it works and doesn't. We could get to the point where we could give a reasonable guess of the errors in the data set for any model and any data set.

It's not perfect, and it would be insane for someone to promise they could find every error in any data set in the world perfectly. There will always be data sets where the labels don't even have anything to do with the features. We showed that you could get guaranteed label error finding within certain bounds of an error on the predicted probabilities. It worked very accurately on several real-world data sets in every modality.

cleanlab works for images, audio, text, and tabular data, and it could work for any model because it takes the labels and predicted probabilities as input and not the model itself. It doesn't use the model or change the loss function or anything like that.

In the beginning, people didn't believe that it worked. I found that very motivating and responded by building and creating the open-source package. I released it, and six months later, tens of thousands of people were using cleanlab from many major tech companies, including Microsoft, Google, and Amazon. It became much bigger than a graduate project and would become a company.

On Introducing Confident Learning

Confident learning provides more accurate citations, which is a really useful thing. From a business perspective, many people are already trying to build their own things based on confident learning. I've seen about a hundred companies emerge in the last two years. None of them are as far along or fully focused on just being able to fix or correct data, which is what we do at Cleanlab.

However, I've seen all sorts of things related to data quality. One interesting experience when founding a company was seeing people copy phrases I wrote in grad school on the open-source repo. "So fresh, so clean lab" was a phrase I used, and I see people always copy that and apply it to their own companies. Another phrase I always use for the open source we use at Cleanlab is "automatically find and fix issues in your ML data." If you search for that, you'll see many companies have just copied that language.

So, there are two answers to your question. First, people are using confident learning. For academics, it's more about the pursuit of knowledge and trying to improve and grow it. They see it as a foundational new paradigm, a new way to think about machine learning in terms of the data. And in some ways, people have told me that they see it as a foundational paper for the new notion of data-centric AI, which is nice. I like that Andrew Ng is supporting data-centric AI and therefore supporting confident learning in the field that we invented at MIT.

From the business side, I'm glad to see many people using the work. However, I would prefer they don't copy our language verbatim. But that's part of the startup world, and it can be intense. I don't worry about it too much. The goal is to produce results faster than anyone else so that even if someone keeps copying you, you'll already be onto the next thing. It doesn't matter if someone keeps copying you since you'll already be building the next thing while they're spending all their time in the dust copying what you did previously.

Overall, the academic side has been pretty good, and the business side is interesting but can be challenging.

On Bringing Confident Learning To The Industry

How does a kid from rural Kentucky build a company from their Ph.D. research that gets used by big tech companies like Facebook, Amazon, and Google? And how do you accomplish this without having any business experience or guidance?

For me, the answer was to spend 10 years organizing my life around this goal. At MIT, I did internships every summer except for one. Each internship was related to my Ph.D. research, and I focused on the cleanlab technology. I worked at Facebook during the early days of cleanlab, where I helped them improve their comment rankings. I noticed that their upvote/downvote labeling was noisy, so I added diversification to the rankings based on semantics. This resulted in a more accurate representation of what people wanted to upvote.

At Amazon, I worked on a similar issue with their audio datasets for the Alexa wake-up word. The labels were incomplete, so we used cleanlab technology to estimate the error rate. At Oculus Research, I used cleanlab to clean up noisy data sets for building the metaverse. And at Google, I integrated cleanlab into their speech team's third-party codebase and helped them clean up data for their Google Assistant product.

The technology of cleanlab can empower companies to train more accurate machine learning models by providing clean data for training. It can also help with more accurate data analytics and dealing with misinformation. This is a big deal for every major company training an ML model.

On Founding Cleanlab

I have worked in various industries, including academia and startups. In 2018 or 2019, I began working with a startup called Knowledge AI in Boston. They reached out to me to help build and understand their AI systems, and I was able to provide support as a grad student. I built some things for their fundamentals, a recommendation engine, and a learning paradigm for new students to learn vocabulary.

Although I was interested in their mission to use AI to improve education, I ultimately wanted to found my own company and use my Ph.D. research. So, I founded ChipBrain, which was a brilliant vision, but I made the mistake of founding it with people I had not known for very long. The company didn't work out due to some health issues and other factors.

However, the experience taught me a lot, and I learned that my research on Cleanlab was worth several tens of millions of dollars. I started getting some friends from MIT, whom I had been working with, to join me in founding Cleanlab. Co-founded with Anish Athalye and Jonas Mueller, both brilliant PhDs from MIT, we were able to build a great founding team.

Once we had the team together, I transitioned to focusing full-time on Cleanlab, which has been doing better than expected. We have been able to generate a lot of interest and start doing deals, and I am excited about the company's future.

On Label Errors

That was Anish's idea. I met with Anish at ICMl in Sweden in 2018, where he had just won the best paper award. At the time, he was young, around 23 or 24. Anish's award-winning paper at ICMl exposed flaws in GANs and defenses against adversarial examples. This sparked his interest in the systematic and systemic ways AI and ML are broken fields.

I was also interested in how ML is broken from the perspective of flawed data sets and their impact on training models. I brought up the errors in the MNIST and ImageNet data sets to Anish, which piqued his interest. I proposed a research paper where we demonstrate that the top ten data sets all have label errors, show what these errors are, and present the corrected test sets. We would then show the implications of these errors on the field of machine learning and how they render benchmarks of machine learning broken.

Anish was interested in working with me on this project, and we worked together for about two years. We published workshop papers, and eventually, I reached out to Jonas Mueller, who had recently graduated from MIT with a Ph.D. in machine learning. Jonas was the smartest guy I knew in the field and was always up-to-date on the latest papers and research. When I proposed the project to him, he was interested in getting involved.

Jonas helped us think about the implications of AI and ML in the real world beyond the curated benchmark test sets. He was invaluable in driving our academic paper home, which ultimately led to the creation of labelerrors.com, where you can check out all the errors for yourself. Anish built the website, and Jonas and I gave input.

Together, the three of us realized we were a pretty cool trifecta in the space. Jonas left Amazon, where he had built AutoGluon, the lead developer behind AWS's AutoML, and joined us as our chief scientist. The space we were in was fantastic, and it was great to start a company with people I've known for over a decade and respect the most in the world, who are absolute stars in their field.

On Lessons Learned As A Second-Time Founder

My biggest takeaway is that if you're going to start a company and it doesn't work out, don't make the same mistake twice. There are dollars and people's livelihoods at stake. When you hire someone, you're responsible for their livelihood. They could be doing anything in the world, but they're choosing to work for your company, so you need to care about and take care of them. There are different opinions on whether employees or customers should be the number one priority; for me, it's both.

Facebook prioritizes its employees, while Amazon prioritizes its customers. I'm constantly trying to do both. I care deeply about providing nurturing growth experiences for our employees. When a company fails, it's a disservice to the people you were trying to provide a livelihood for.

Therefore, it's important not to repeat the same mistakes. My mistake at ChipBrain was founding the company with people I didn't know well enough. I didn't repeat that mistake with Cleanlab. I founded it with people I've known for a decade and who are reliable. When you have a seed stage startup or a series A, it's crucial to found it with people who you can trust for the long term.

On cleanlab 2.0

The previous version of cleanlab focused on providing simple methods to find label errors in data using just a few lines of code. With cleanlab 2.0, we aimed to make the code more consistent and provide strong documentation that anyone can read and learn from. In doing so, we hoped to encourage more people to contribute to cleanlab and build a community around it. We wanted to create a framework for the future of data-centric AI that is accessible to everyone, not just graduate students.

One of the things we are currently working on is out-of-distribution and outlier detection, which has already been added to the package. We are also developing nearest-neighbor solutions for feature-based error detection where labels are not available. Another new feature is multi-annotator support, which enables finding quality labels when multiple annotations are available. These additions were developed by a team of fantastic engineers led by Jonas Mueller.

In the future, we plan to add support for regression and address various data issues that can arise during model training, such as having fewer labels than the number of classes. Additionally, we plan to support semantic segmentation for tagging and NLP tasks, as well as object detection. We have a lot on our roadmap and are excited to continue building cleanlab with the community.

On Cleanlab Studio

In open source, you can identify issues in your data, but you cannot easily fix them with Python alone. You need an interface to visualize, interact with, and manipulate the data. You need to be able to identify the group of data that is incorrect and specify what you want to fix. Doing this through a command line interface is possible but challenging and not ideal.

To address this problem, you would need to build software specifically for your data set, which can be a time-consuming and frustrating process. However, with Cleanlab Studio, you can fix your data set without writing any code. The no-code solution allows you to drag and drop your data set, upload a JSON file, or integrate with Databricks.

With a single click, Cleanlab Studio will automatically identify all the issues in your data, order the data set, correct your labels, and provide a clean model. This enables you to train a clean model on clean data, which can improve the reliability and accuracy of your results. Though you can also write code to interface with Cleanlab Studio, the no-code solution is available and supported for your convenience.

On Cleanlab Vizzy

Many people wonder how confident learning works. We invented this field at MIT, and Cleanlab is built on confident learning. People often ask us how it works, and unlike most organizations and businesses, we are open about our algorithms.

At MIT, we have an open-source culture that we want to share with the world. We hope that the world will also contribute to our efforts. We are willing to give away a lot because we provide a valuable product, Cleanlab Studio. We can afford to release these algorithms because we have much to offer, and Cleanlab Studio has already delivered significant value to many people.

Cleanlab Vizzy is an educational platform that allows you to easily understand how these algorithms work. Through this platform, you can play around and see how Cleanlab Studio operates and how confident learning works at a simple level.

On Hiring

When starting a tech company, especially in an executive role, it becomes apparent that HR is one of the most important things. Building a good culture is often overlooked. My friend Cody, who you also interviewed, always says that our first product is our culture. I agree with him, and I embody this idea. At Cleanlab, we have a really good team.

If you go to Cleanlab.ai and look at the team, you'll see that it's a very good team. About half of us are from MIT, and the rest are from Stanford, University of Washington, Cornell, Harvard, UPenn, and NYU. It makes a huge difference when you have a team like that, which is only 15 people, but we actually interviewed 800 people.

That's really unheard of for a seed-stage startup, but we had a good idea, a good technology, and a good market fit. We needed to build a good team. Not everyone has to do this, but we're building a revolutionary new technology that makes AI work on any dataset for any model. That's hard. This isn't a quick thing. It's going to take a very special think tank.

So we had to be really careful about the people we hired. We wanted good, moral, and kind people, and we did a really good job of that. That's probably one of the things I'm most proud of that I've been involved in, along with the founding team of Anish and Jonas. The team themselves also vetted everybody, and we've established a special group of people.

That was probably one of the hardest things we did, but also one of the most important things. It's interesting because you think building the business or having the right pricing model or the right marketing team is most important, but we already had a pretty good market fit and pricing model. What we needed were the right people to build something that's going to stand for a while.

On Shaping Cleanlab’s Culture

Part of my goal in working at Google, Amazon, Microsoft, Facebook, and Meta Oculus was to observe how each company works and incorporate the best practices into Cleanlab.

During my interview with Facebook as a researcher, they worked with me to find a manager and project. At Amazon, the project was predetermined before finding a manager, while at Google, the project was driven by me with a team and product in mind. At Microsoft Research, the work was entirely free-flowing and academic.

At Cleanlab, we combine the best practices we observed to create a clear roadmap and culture. We communicate this roadmap to new hires and jump straight into interviews to explain the day-to-day tasks. This approach has helped us find the right people for the job and move quickly toward our goals.

On Being A Researcher vs. Being A Founder

Both a founder and a researcher require vision and can be more effective if they possess good leadership skills. However, there is a significant difference between a leader and a manager. A leader is someone who has a vision that they see before anyone else, and their goal is to get everyone else to see their vision with them. If they achieve this, they are a good leader. A good manager helps people execute that vision and get it done on time. A founder must be a good leader, but they don't necessarily have to be a good manager as long as they hire good managers. However, they could also be both.

A researcher, on the other hand, must be a leader in their field and a good manager of themselves since they are the ones doing the work. In contrast, a founder hires people to do the work since their full-time job involves external communication, investor relations, customer management, team building, and more. Operating a company requires a lot of management, which can significantly reduce the time founders have to contribute to technical work.

Both a founder and a researcher must be good leaders. However, being a good manager is more important for a researcher in terms of managing their own time and projects. Being a manager is less crucial for a founder since they can hire managers.

Effective communication is crucial for both roles. Researchers must be able to communicate their findings mathematically or in standard academic language. In contrast, founders must communicate their vision to the public so that they understand the value of their product and how it can benefit their lives.

On Building SOTA Deep Learning Machines

I worked in a quantum computing lab, but I was doing machine learning. I didn't have other grad students to show me how to train models or teach me about AI. However, I had friends, such as Jonny. I asked him what his group did for their research, and he mentioned that they had access to GPU rigs to run experiments. I realized the importance of having access to such equipment, so I built my own powerful machine with a limited budget of $7,000.

It turned out that we spent $30,000 running experiments for our first papers on a cloud service. This made me realize that people may not be able to afford to run AI experiments and may be prevented from advancing the field of science that benefits humanity. To help others, I wrote the L7 Learning blog at l7.curtisnorthcutt.com on how to build a GPU rig, which I shared for free.

The blog gained attention and was picked up by various news outlets, garnering hundreds of thousands of viewers and readers. Many universities across the nation and globally have built their own GPU rigs based on the blog, which is exactly what I wanted. I am happy that more people can now access affordable AI equipment and conduct ML research. I want to empower as many people as possible to engage in AI and ML, and I think it's awesome.

On PomDP The Ph.D. Rapper

As a rapper, I wouldn't say it's a career. I would love for it to be, but realistically, you can't be the CEO and co-founder of a rapidly growing company like Cleanlab while pursuing a rapper career. It's funny how we tend to doubt that someone with a scientific background could actually be a professional rapper. We want to laugh and deny the possibility, but that's a serious failure of society. I believe that many of the best scientists can also be the best artists, as we've seen in the case of Leonardo Da Vinci.

Although doubting someone's ability to do both as full-time careers is different. I have thousands of unreleased songs, of which about two to three hundred could be released, and 50 to 60 are already radio-ready. However, it takes a lot of time and work to market and produce music, make albums, and do concerts. It's like a full-time job.

Music has been therapeutic for me. I grew up with a lot of pain and hardship, and music was a way to express and expurgate it. As I got older, I started making songs about science and music itself. I even made the MIT song, featured in a TED Talk and played at MIT's commencement ceremony. I've also done concerts in India and founded a record label.

Rap music is something I enjoy because it's so human and about telling a story. I am pretty good at it, and it's a place where I can be confident. It's also a humbling experience to release a song, as everyone is listening and judging. Music is different from science and building a company, which should be approached dispassionately based on facts and truth.

Rapping is an outlet for emotions and creativity without affecting scientific experiments with bias and emotion. It's helpful for people to have a musical outlet. If one day Cleanlab IPOs and my role changes, I'll consider pursuing a rap career seriously.

On His Favorite Rap Artists

There are many great rap artists, but some people are afraid to mention Eminem because he's considered a standard. However, he's one of the best lyricists of all time. I grew up listening to his "8 Mile" soundtrack album, which contains some of the best songs in the world.

The songs, such as "8 Mile Rabbit Run" and "Lose Yourself," encapsulate feelings of pain and struggle that many people can relate to. They brought me peace, motivation, and strength when I was growing up in a hard environment. I want to share that feeling with everyone.

Aside from Eminem, there are other great artists that deserve recognition. Some modern ones include Joyner Lucas and Logic, who have good messages to share. I also like some older artists like Ludacris, who has a unique and impactful flow.

I appreciate gangster rap because it makes me feel badass and empowered. It's a cool feeling to listen to a song and suddenly feel stronger. 50 Cent's music and some of the earlier Lil Wayne and Drake songs are great for this. Nowadays, there's more mumble rap, but many great songs and artists are still out there. I have respect for most artists.

On Cultivating Grit, Resourcefulness, and Friendships

You can't run a startup from seed up until you aim to become a unicorn, which typically involves reaching a billion-dollar valuation. While this may be a reasonable goal for startups that want to grow quickly, it's important to remember that it's a limited point of view. Achieving this goal requires grit, determination, and resourcefulness.

As you build your company, you'll encounter obstacles that require you to be an immovable force with nonstop momentum. No matter what happens, you must continue to go forward relentlessly. Grit is the number one thing you need to succeed, regardless of where you came from.

Resourcefulness is also key. Spending all your time doing things the hard way will only result in wasting time. Instead, you need to be creative and resourceful, finding hacks and tricks to do things quickly and efficiently. Keep track of all the resources you have access to and tap into them whenever possible. This can help you save years in the building of your company.

Finally, friends made along the way are important. You can't do anything alone, especially if you want to build something big that can change the world. You need to get people on board with your vision and make friends who can help you achieve your goals. These friends can enrich your life and help you better understand people, which is essential if you're building a product for people. By knowing people's goals, fears, and desires, you can better contribute to the world and make a difference.