Datacast

Episode 100: Data-Centric Computer Vision, Productizing AI, and Scaling a Global Startup with Hyun Kim

Episode Summary

Hyun Kim is the co-founder and CEO of Superb AI, an ML DataOps platform that helps computer vision teams automate and manage the entire data pipeline: from ingestion and labeling to data quality assessment and delivery. He initially studied Biomedical Engineering and Electrical Engineering at Duke but shifted from genetic engineering to robotics and deep learning. He then pursued a Ph.D. in computer science at Duke with a focus on Robotics and Deep Learning but ended up taking leave to further immerse himself in the world of AI R&D at a corporate research lab. During this time, he started to experience the bottlenecks and obstacles that many companies still face to this day: data labeling and management were very manual, and the available solutions were nowhere near sufficient.

Episode Notes

Show Notes

Hyun’s Contact Info

Superb AI Resources

Mentioned Content

People

  1. Andrew Ng
  2. Andrej Karpathy
  3. Ian Goodfellow

Book

  1. Zero To One (by Peter Thiel)

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to or browse the full guest list.

Episode Transcription

Key Takeaways

Here are the highlights from my conversation with Hyun:

On His Upbringing

I was born in Seoul, Korea, and moved to Singapore when I was 12. Our family moved together for my dad’s work. Initially, I lived in Singapore for about two years, came back to Korea for one year, and went back again to Singapore for five years. I spent a total of 7 years in Singapore. Right after graduating high school, I went to university for my undergraduate degree. It is a rare experience which I am grateful for. Living in three countries is rare. Singapore is also known for its diversity and culture, with people from diverse backgrounds, such as Asia, the US, Europe, and Africa. Being able to blend into that cultural mix from a young age opened my eyes to being more global for my future career pursuit.

There are several international schools in Singapore — American, British, Canadian, etc. For my higher education, the United States was my obvious choice. That is why I went to the Singapore American School, which naturally led me to apply to US universities.

On His College Experience

Since my junior and senior years of high school, I have been interested in bio-engineering and genetics, which led me to apply to university programs in biomedical and bioengineering. Duke is a university that is very open to undergrads volunteering in research labs. Initially, I decided to major in biomedical engineering during my freshman year. I volunteered in a research lab that focused on genetics during my freshman year. At that point, there was a popular technique called CRISPR-Cas9, so I researched a bit into that. After freshman year, I took a two-year leave and returned to Korea for military service.

After I came back for my sophomore year, I saw the rise of computer programming and computer science and how they were disrupting other domains. One of them was systems engineering — in which there were many simulations modeling in genetic engineering. Another cause for me to switch from biomedical engineering to electrical engineering was that: I spent the year in a genetics lab trying to engineer bacteria and yeast to produce a specific chemical I wanted. My experience basically failed for the entire year. It was very difficult to debug where I went wrong. That frustration got me into automation. I was spending a lot of time doing repetitive tasks of pipe bedding and running experiments. I thought most of that could be automated using some machine. That led me to computer science, automation, robotics, machine learning, etc.

In junior year, I started volunteering at a research lab that worked at the intersection of biomedical engineering and computer science. It was a medical imaging ML research lab. I used ML and deep learning to analyze brain MRI images. A year later, I shifted focus to robotics — initially medical/surgical robots. For my graduate school, I applied to many robotics and ML programs and decided to stay at Duke for my Ph.D. — where I studied robotics and computer vision.

On Medical Imaging Research

My medical imaging research lab focused on deep brain stimulation, a way to treat Parkinson’s senses. Neurosurgeons will insert metal rods into a particular area of the patient’s brain and use that rod to give electrical stimulation to the brain. The problem was that neurosurgeons would have to use their best guesses to figure out where in the brain they should insert the metal rods into. If that failed, the fatality rate would be pretty high. So I was trying to better predict regions within the brain that the neurosurgeons should target.

Previously, my colleagues and I used non-deep learning techniques (such as Random Forests) to improve performance. Around the same time, we saw another research lab that worked on similar problems using convolutional neural networks. They were outperforming anything that we had. That got me into the whole deep learning and neural network movement. It was clear that, around 2013–14, ConVNets outperformed everything on computer vision tasks. That was quite shocking to me. I realized there was a real substance to deep learning and decided to pursue my Ph.D. in this domain.

On Getting Into Machine Learning

As a biomedical engineering major, I was exposed to computer programming, mostly in Matlab rather than Java or Python. After returning from my military service, I started taking classes in computer science. My second major at Duke was electrical and computer engineering, so I was exposed to programming early on. For machine learning, the first course I took was Andrew Ng’s Coursera course back in 2014. After that course, I sat in a bunch of graduate-level courses in ML. I think I caught less than 20% of everything, but I just sat there trying to absorb as much as possible.

On Pursuing A Ph.D. Program

After my junior year, I was still figuring out what I wanted to do after graduation. I talked to many people who pursued different careers: those majoring in pre-med who went on to become doctors, those who went to law school to practice law in tech sectors, and professors/academics who pursued their careers with industry support. After talking with all of these people, it was clear that pursuing a Ph.D. was the right thing to do for me. Someone told me that if I do not hate studying, I might just as well pursue a Ph.D. and do whatever I want after that — whether working in the industry or staying in academia. That was the advice I got. At that point, I liked learning new things, studying, and getting good grades, so that was a natural progression to my Ph.D.

That came back to haunt me about a year into my Ph.D. because I do not think any of my lab mates came directly after their undergrads. They all worked in the industry for several years, so they knew what they wanted to work on during their Ph.D. Whereas for myself, I did not have that. I was basically interested in robotics, automation, and deep learning, but I did not have any particular research topic in mind when enrolling in the program. I had to discuss with my Ph.D. advisor to decide what is a cool topic to work on. That was not enough motivation to keep me going for 5 or 6 years, so I took leave after one year.

On Taking A Leave From His Ph.D. Program

  1. I was born in Korea and raised in Singapore. Even though I consider myself Korean, I have yet to have much experience in Korea. I always wanted to know what working in Korea at a Korean company is like.
  2. I was still determining what I wanted to research during my Ph.D., so I wanted some industry experience to find out.
  3. In 2016, there was the AlphaGo event in Korea — where DeepMind’s AlphaGo defeated Korean Go master Lee Sedol. After that, there was a huge shock across the entire country. Big companies such as SK Telecom, Samsung, and LG started investing a lot in AI. That opened up many opportunities for me.

These three motivations came together, and I decided to take a leave. The plan was to take a 1-year leave and return to my Ph.D. program with a clear sense of my research direction. But then, it ended up being a two-year leave of absence. Finally, I ended up never going back to complete my Ph.D.

On Doing Research At SK Telecom

The first research I worked on was Starcraft AI. In Starcraft, you do not have full visibility, and it is also real-time in the sense that all players play simultaneously. Those things make this a more challenging problem than the game of Go. There were two big topics that we worked on for Starcraft:

  1. Strategy planning: What kind of strategy is the opponent playing? What is a good counter strategy for that?
  2. Tactical planning: How to fight these battles? How to control your units? This requires reinforcement learning control (such as Multi-Agent RL).

The second research was synthetic image generation. Back then, generative adversarial networks were big, so we used them to create synthetic images.

In academia, researchers have more flexibility and freedom as to what they can work on. As long as it is something new and has an impact on the entire academic research community, it is considered valuable. In the industry, you need to consider the industrial impact — can the research be applied to the product offerings or align with the company’s long-term direction? The research team was set up to pursue SOTA research but did not have full flexibility and freedom on the topic.

On The Founding Story Of Superb AI

Both during my Ph.D. and my two years at SK, I worked on various research projects such as Robotics, Starcraft, and synthetic generation. During these projects, it was clear to me that I was spending a lot of time on data rather than running experiments or writing research papers. That was also the case for basically anybody else I saw in the AI community.

Researchers and engineers spending a lot of time with data was a huge time waste and inefficiency for the entire AI community, so I wanted to solve this problem. I also saw some new techniques and technologies in AI that were becoming mature enough to be applied in real-world products and services. Thus, I decided to start a company to tackle the problem. Luckily, I had some good colleagues around me then, so I started persuading everyone on my team. Luckily, I was able to persuade a couple who became co-founders.

We have five co-founders, including myself. All five are still at the company. One of the five co-founders was more people-oriented and outgoing, so he would take the business side of things. Among the remaining ones, the three (excluding myself) were more technical and had more programming, AI, and engineering experience, so they would take the product and engineering side. As for myself, I have global experience and exposure to various fields in tech — bioengineering, robotics, medical imaging, and machine learning. Given those things combined, we thought that I would be the person doing the overall strategy, fundraising, and company building.

We thought to ourselves that we should be a global company targeting a global audience from day one. That was my responsibility from the get-go. The market size in Korea is limited. Since we are a company that helps other companies build ML applications, the biggest market was obviously Silicon Valley. From day one, we were determined to have our presence in both the US and Korea. That is why we worked to set up our head-quarter in the US.

On Going Through the Y Combinator Winter 2019 Batch

We were lucky to be accepted to YC on our first try. Many founders apply several times to get into the program. The whole process of applying and going through YC was a good learning experience for us. The YC application asked very weird questions and forced me to think differently. And then, in the first week of the actual YC program, the first assignment I got was coming up with a plan to be a billion-dollar startup. As founders of a newly-founder startup that had been around for six months, we did not always think about that. We thought about how to land the first customer and how to build the first product. That is the kind of thing our mind gets bogged down on. But then, they forced us to think big and think long term.

Throughout the three months, they taught us skills like how to hire, how to fundraise, how to build a product, how to pitch the product, how to price the product, etc. But the more important lesson was the overall entrepreneurship-founder mindset by talking with successful founders. During our batch, founders of Airbnb, PagerDuty, and Stripe gave speeches. We had Paul Graham as well. These talks gave us a better mental model of how to approach things. That was the more valuable lesson that I got from YC.

The YC community is very helpful in getting your startup up and running. I could reach out to other founders who have been through the same issues I was going through. I could post something to the community, and someone would jump on to share their advice. The investor network was helpful in raising the initial funding. After the YC demo day, we had investors contacting us to get our seed round started. Many pieces of technical advice on the YC community were either written by the YC organizers or some of the more successful founders. Founders of the larger companies stay in the community and pay it forward.

On The Evolution of Superb AI’s Labeling Platform

We did not have a platform from day one. The first thing we built was an automatic data labeling algorithm. For our first client, we would get raw images, run our in-house algorithm, have someone edit or review the algorithm’s output, and send the client the results. That was our first business model, even before getting into YC. We had about three clients and some revenue by then.

After YC and during our seed round fundraising, we decided to add the platform component to our business. We saw a lot of companies that were labeling data, but then there was also a big pain point where these companies were gathering so much labeled data without a way to manage or collaborate on the labeled data. That is when we decided to add the platform capabilities, which include tools for data management, labeling, and collaboration. That is how we added more features to the platform.

Generally speaking, we try to focus on answering the question: “What is the biggest pain point in the industry right now?” The answer to that will drive our product roadmap.

  1. When we first started, data labeling was one of the biggest pain points, where people relied on outsourced manual labeling. So we tried to fix that problem with automatic data labeling.
  2. Another big pain point around 2020 was data management. There was no tool to manage millions of labeled image data. So we came up with that.
  3. After that, there were things like reviewing and auditing labeled data. People were trying to inspect every labeled image manually, but that was not very efficient. That is when we came up with an automated way to review and audit labeled data.

On Custom Auto-Label

The whole point of Custom Auto-Label is that: There are a lot of off-the-shelf pre-trained models for identifying or labeling common objects. But then, most of our clients or most practitioners work with their niche datasets. For example, in manufacturing, they might want to identify scratches on metal plates. There are a bunch of applications that use niche datasets. Even for common objects, the viewpoint might be different. It might not be at eye level, but at a top-down angle, like in physical security surveillance applications.

For these use cases, a typical off-the-shelf model will not work well. So, these companies need to quickly train models using a small batch of labeled data to train a model and leverage it to semi-automate the data labeling process. That means we need to be able to train models using a small number of labeled samples. That leads to few-shot learning, zero-shot learning, and transfer learning techniques that leverage pre-trained models on different datasets.

We also want people without any ML background to be able to train models and apply them to the data labeling pipeline. That means we need to build more AutoML capabilities so users do not have to tune hyper-parameters to obtain a good-performing model. The Auto-Label AI will score or evaluate itself on how certain or uncertain it is about its own output. The users will be able to sort or filter by that uncertainty score and focus on reviewing images with very high uncertainty scores.

On The Newly Released Superb AI’s DataOps Platform

Our first product, the Labeling Platform, was geared towards labeling and auditing datasets more efficiently. The new DataOps Platform is geared towards ML engineers who want to analyze labeled data or even raw datasets before the labeling step. Let’s say you are in a self-driving car startup and collect terabytes of data daily. There is no point in labeling every single one of those images. You want to be able to analyze your dataset and see which type of images you need to add to your labeled dataset. That is one thing we want to solve with the DataOps platform — being able to analyze the distribution of your dataset, see if any gaps within the dataset will lead to degradation in model performance if you use that dataset to train a model, identify those gaps in advance before you train your model, and help you fill those gaps.

Another thing is that let’s say you have a labeled dataset and randomly split it as a 20% test set and 80% training set. That is not the ideal split. We want to provide our users with a way to better utilize their labeled dataset by giving them the optimal split between training and test sets. That is another feature we provide for DataOps.

One of the things Andrew Ng mentioned was that we should move away from big data to good data.

  1. In some industry applications of AI, there is an inherent lack of data, so it is impossible to gather millions of samples for these applications. For these cases, we help our clients figure out: if they are going to collect like 1000 images, what kinds of images should they be? That is one way we tackle the data-centric AI problem.
  2. For applications with abundant data, like self-driving, not all data is made equal in terms of the value added to train the models. Being able to pick out which image or rich raw data will be more valuable for the model. Labeling those selected images will be a more efficient way to train models and approach ML development from a data-centric perspective.

On Superb AI’s Future Roadmap

As we have been doing for the past five years, our product roadmap will always address the biggest pain point in the industry. Some ideas that I have could be:

  1. As AI technology gets adopted in more traditional industries like manufacturing and agriculture, these companies do not have as many ML engineers as compared to more tech-heavy companies. So there will be a need for training and deploying models without any ML engineers. Superb AI may be able to address that pain point, whether by ourselves or in conjunction with other startups tackling the whole MLOps/DataOps sector.
  2. Another possible option for our product is the data collection side of things, whether raw data comes from edge devices or synthetic data is generated. We will have to see how much value that brings in terms of being able to train models and improving model performance. If we decide that it is a valuable task, we might tackle that area using some partnerships with other companies or building something in-house.

On A Customer Use Case

The use case I am most proud of is a company that manufactures products for Samsung and LG Electronics. Their big pain point was that they had been manually inspecting these products for defects — electrical soldering or rubber packaging. They had been relying on a manual workforce to find these defects in their factories, and, interestingly, they do not have any ML background. They were able to purchase our platform and use it to train models from scratch with zero knowledge of ML. They deployed models to one of their assembly lines in one of their factories and have seen some success — our AI models caught errors and product defects that their human workforce was unable to detect, so they were able to improve their assembly lines and become more efficient as a company.

There is a huge opportunity to expand that application to multiple different factories and different companies in the industrial manufacturing sector. Furthermore, the fact that they could implement AI without any knowledge of AI speaks a lot about our mission of democratizing AI. The whole point is to lower the barrier to AI so that more people and companies can adopt and build AI more efficiently. This use case speaks a lot to that.

On Technical Partnerships

The MLOps problem is too big for a single company to solve. There are startups that come together and form partnerships, one of which is the AI Infrastructure Alliance which Superb AI is a part of. The Alliance is basically trying to come up with the canonical stack for ML. We do not have that stack right now for ML as we do for DevOps. As a company, we partner with companies like Pachyderm, WhyLabs, and Arize AI to provide our clients with a more streamlined way to integrate or use multiple products more seamlessly.

I imagine that many of these MLOps solutions assume their clients already have some labeled data or trained models. For example, monitoring platforms require the client to have some model already trained and deployed in production. Or data pipeline companies assume the client already has gathered some dataset. I want Superb AI to be the warehouse for all data, whether raw or labeled data, and integrate with different data pipelines, model training, or model deployment services. You can think of it as GitHub or GitLab in the DevOps world (which has all the software code in them). That is how I see the MLOps industry and Superb AI’s role in it.

On Hiring

The first challenge is building a great company culture. There could be ups and downs to a company’s growth. Ideally, the company will grow yearly, but more realistically, there will be ups and downs. It is important to have your team continuously motivated to survive the downside. Building a great culture requires many different things. For example, we need to hire the right people that will continue to keep and improve upon the current culture. It needs not just the company’s core values since every company has these values. It is more like: Do the CEO and the leadership team actually have those core values ingrained into their day-to-day lives? That will trickle down to the entire company. I spend a lot of time on those things.

Since we are a global company with offices across three different countries, we have an additional layer of communication challenges. Someone in the US will have to collaborate with someone in Korea or Japan. That will not be in real-time. It will be over Zoom, and you cannot do that without a great culture to bring everyone together. I spend a lot of time diagnosing: Are there any things that are not going well? How can we address that? How can we improve our culture? How can we hire better people?

The second challenge is empowering others, especially the leadership team. Obviously, I do not have 30 years of industry experience in sales, marketing, strategy, finance, etc. It is my job to hire someone much better than me in these departments and empower them. Once I have them, I will occasionally give my take on things, but I should be able to empower them and give them the safety to make decisions on their own with increasing responsibility. Hopefully, I have been doing that well, but it is something I keep trying to get feedback from my direct reports: Am I empowering you enough? Can I facilitate whatever you are doing?

On Scaling a Global Culture

Interestingly, there are few good resources out there for scaling a global culture, probably because there are not that many global companies at our stage. There are a lot of global enterprises, but not many Series B-stage companies that are spread across the globe, like Superb AI.

On Fundraising

First off, there are better times to raise funding now. That is my number one piece of advice.

There is a lot of value to getting funding from well-known, renowned VCs, but there are drawbacks too. I would not always optimize for the name value. Instead, try to build a relationship with VCs in advance and see if that is someone you are willing to build a company with for the next ten years. Especially if the investor is coming onto your board, that person will make significant decisions with you for the next decade. Consider it almost like finding your co-founder. You will see the board members for decades, once every quarter. You will discuss major issues that influence the whole company’s direction and future decisions for a long time. So be careful whom you let into your board.

On Being a Researcher vs. Being A Founder

If you are a researcher, you can work on any topic you want. If you want to get more citations, you obviously want to work on a more popular topic in academia. But if you are a founder, you have to build a product that someone will use. It is not an option. If you build something just because you want to build it and no one actually uses it, your company will die.

In academia, you are free to work on a research topic that will have an impact in 10 years. That still might be worthwhile research. But if you are building a product as a founder, you do not want to build something that people will use in 10 years. You have to time it very well.

As a previous academic myself, I was once very into technology. In ML, algorithmic performance was a big thing for me. But then, productizing it, selling it, and doing all the go-to-market is a whole different story. If you are too bogged down with the technology, you will not be able to productize it and build a business around it. Not all good technology leads to a good company.