Datacast

Episode 117: Vector Databases, The Embeddings Revolution, and Working in China with Frank Liu

Episode Summary

Frank Liu is the Director of Operations at Zilliz with nearly a decade of industry experience in machine learning and hardware engineering. Prior to joining Zilliz, Frank co-founded an IoT startup based in Shanghai and worked as an ML Software Engineer at Yahoo in San Francisco. He presents at major industry events such as Open Source Summit and writes tech content for leading publications such as Towards Data Science and DZone. Frank holds MS and BS degrees in Electrical Engineering from Stanford University.

Episode Notes

Show Notes

Frank's Contact Info

Zilliz's Resources

Mentioned Content

Articles and Presentations

People

  1. Yann LeCun (Chief AI Scientist at Meta, Professor at NYU)
  2. Yangqing Jia (Creator of the Caffe deep learning framework)
  3. Soumith Chintala (Creator of the PyTorch deep learning framework)

Book

Notes

My conversation with Frank was recorded back in August 2022. The Zilliz team has had some important announcements in 2023 that I recommend looking at:

  1. The landing page of Zilliz Cloud
  2. The beta launch of Milvus 2.3
  3. The development of GPTCache
  4. The OSS Chat demo application

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are the highlights from my conversation with Frank:

On His Upbringing

From a very young age, I witnessed the hard work and dedication my parents put in to bring us over from what I consider to be a disadvantaged background. The economy of China was not great back then, and we moved from place to place until I was about eight years old. This upbringing gives me a lot of respect for my parents and grandparents for their emotional and psychological support in supporting our move to Oregon.

Though I don't remember much from my early childhood, playing chess in middle school and tennis and table tennis were my go-to sports for physical activity. In high school, I took several engineering classes at our local university, forming the basis for the work I do today.

These experiences, including being born in China and moving to the US at an early age, have shaped who I am today.

On His Education at Stanford

Stanford is an amazing place, and the four or five years I spent there were some of the most formative years of my life. I met different people, experienced different cultures, and took a variety of classes. However, Stanford can also be a bubble, as many of my fellow graduates would agree. Nevertheless, it was a unique and formative experience.

During my time there, I gained both research and industry experience, which, combined with my coursework, helped me develop many soft skills, such as time management. Although there are many individual experiences I could talk about, I won't go into too much detail. Suffice it to say that the combination of meeting new people, diving deep into coursework, and gaining research and industry experience made my time at Stanford one of the most unique and formative experiences of my life.

One particular class that stands out is the Entrepreneurial Engineer. Stanford's connection with entrepreneurship is one of its unique features, and that class opened my eyes to things beyond my major coursework. Another class that formed a significant part of my experience was Introduction to Humanities, which helped me engage with people outside the engineering world. Having a broad, well-rounded experience has definitely helped me out later in life.

On Getting Research and Industry Experience

I had two different research experiences while I was at Stanford. The first was in computer vision, specifically the ML side of things. The second experience was in digital design, which was closer to my major.

As an intern at Intel, I gained valuable industry experience. All of these experiences together influenced some of my decisions later in life. They definitely put me on the path I am on today.

As you mentioned, I co-founded an IoT startup in Shanghai, which required a lot of digital and electrical engineering design work with my team. Additionally, my early academic experience in computer vision research tied directly into the work I did at Yahoo and the work I'm doing now.

Some people claim that being a research scientist and a software engineer are very different fields or roles. However, I strongly disagree. Both roles require you to have a good breadth and depth of knowledge in your field and engineering skills in addition to doing researchThe only difference is the mode of communication. For example, if I'm building a new piece of software, I might communicate with the community, users, and customers through various channels. However, this is more often done in academia through written communication, such as writing a 10-page paper.

Both roles require a strong understanding of your field and a bit of breadth to communicate with other team members. I think these two roles tie in very well together, and there is quite a bit that people who do research can learn from software engineering and vice versa.

On His First Job at Yahoo

Right after graduating in 2014, I would say that the paper that really kicked off a lot of this work, and a lot of the awesome stuff that we've seen in the past decade in computer vision, especially as it relates to machine learning and deep neural networks, was Alex Krizhevsky's paper back in 2012 on ImageNet classification.

Even two years later, I think the industry was still figuring out how to use machine learning, not only for NLP and computer vision but also for many other applications. We were still trying to get it deployed in production, and I would say that the industry is still very much figuring that out even today. Back then, it was like the wild west of ML and computer vision.

You asked about the accomplishments that I'm most proud of. I won't really talk about myself as an individual, but I'll talk about us as a team. I'm absolutely most proud of the computer vision and machine learning team. Not any of the individual models or architectures that we created or contributed to the broader computer vision or machine learning community, but more proud of our ability to put many of these models in production at a very early stage.

I won't go into too much detail, but in some way, shape, or form, we had just under 10 models being used by various teams in Yahoo. I worked very closely with the Flickr team when Flickr was a part of Yahoo. We had tons and tons of data, so understanding what type of data we should use for these models and how to use that data, especially as it relates to images and image metadata, was a challenge that we had to figure out as a team. I think that's our proudest accomplishment.

We did it when not many folks in the industry knew how to, and I think we did it when it wasn't necessarily optimal or pretty. There are so many pieces of infrastructure we have today that we didn't have back then, and if we did have them, I think it would speed up our development by maybe five or even 10 times.

Some of the work I'm doing right now with Milvus and with Towhee, these two open-source projects that Zilliz works on, is a big reason I decided to join Zilliz. In computer vision, machine learning, and deep learning, in particular, there was a distinct lack of good tools that you could use to put a lot of these solutions and systems into production. Even today, we are probably three to five years away from having very solid MLOps platforms and great ways of putting many of these models and developing these AI applications.

The tooling back then was in very early stages, and going through a lot of that helped me understand some of the problems that the industry faces even today.

On Co-Founding Orion

This ties back to what I was talking about earlier, where I had many experiences at Stanford, not just in computer vision and research but also in industry and digital design. Even today, the IoT industry is still in its infancy in some ways. My co-founder and I saw a significant gap in IoT, particularly in indoor localization. For example, suppose we are in an office building, a shopping center, or even a warehouse. It is often challenging to navigate and locate specific rooms or items, especially if we are unfamiliar with the area.

In a warehouse, there could be tens or hundreds of thousands of items and boxes that need to be tracked, including machinery, goods, or consumer electronics. We spend around 80 to 90% of our time indoors, so there was a significant need for an accurate and efficient indoor localization system. This gap prompted us to start our IoT company, focusing on indoor locations. As a new technology, we focused on the hardware side of things initially, which can be quite challenging and require significant funding.

Beyond the early stages, we found that a lot of growth investors were unwilling to invest in hardware, which has become less popular in Silicon Valley. However, we saw an opportunity, particularly in China, where there was a push to improve the IT industry. We raised some money from investors based in China and the US and established our headquarters in Shanghai.

Today, Orion is in a healthy place, having wrapped up a series A funding round in the middle of the pandemic. The company is self-sustaining, and although I'm not involved in the day-to-day, I still keep in touch with many people there. I consider it one of my proudest career accomplishments so far.

On Risk Tolerance For Hardware Startups

There is a risk tolerance to consider when discussing risk. The risk is much higher if you start a hardware company in Silicon Valley or anywhere in the US. The lead times are much longer.

If we had a board design in China, we would have gone through about 10 iterations for our product line. The turnaround times for R&D can be much shorter, sometimes less than a week. On the other hand, if we were to do it in the Bay Area or up in Seattle, the turnaround times would be much longer, probably three weeks. During that period, you may find yourself sitting around and twiddling your thumbs, wondering about potential problems with the board design once you get it back from manufacturing.

As an IoT startup, we also had a software and cloud component to our entire stack. The risk is high on both sides of the Pacific, but it is definitely greater here in the US than in China.

On Leaving Silicon Valley for Shanghai

Another big reason I haven't had a chance to elaborate on in previous podcasts is that I was born in Shanghai and wanted to get a closer look at the culture. Having been raised in the States, attending middle school, high school, and college here, I think it can be difficult to understand many worldviews, especially those of East Asia, Europe, or South America, if you don't open your eyes to different perspectives.

A hidden reason why I felt comfortable with doing a startup in Shanghai was that I wanted to gain a much better understanding of the culture and the way people do business there. It is absolutely different from how things are done here in Silicon Valley. There's no defined way of doing things, but there's a different way of raising money, doing business, and reaching out to customers and users than you would find in China.

Having gone through these experiences and being able to understand a lot of the tech world on both sides of the Pacific, as well as a lot of the culture in East Asia, in addition to how things are done here in the US, is something I'm glad to have taken away with me. This is an experience that I'm happy to have had, and I'm grateful for the opportunity to have gained a much better understanding of the culture in Shanghai.

On Living and Doing Business in China

I wanted to write these stories because I feel like there's a lot of misunderstanding between the cultures of the US and China.

In China, much of what is done revolves around a social contract where individuals give up some of their freedom in exchange for being cared for by local, provincial, or even central governments. On the other hand, in the US, there is a culture of freedom of expression and individualism, with a focus on entrepreneurship and turning ideas into successful companies.

The stark contrast between these two social contracts can lead to misunderstandings between people who grew up in China and those who grew up in the US. This series of blog posts aims to help people understand the social contract in China and the reasons behind the policies and practices there.

For example, the pandemic response in China was very swift, with high-impact lockdowns in many areas. While I may not agree with these lockdowns, I think it's important to understand why they are being implemented and how they tie into the social contract in China.

One reason for the lockdowns may be to retain talent that has left China in the past for opportunities in other countries. The lockdowns may also have secondary reasons beyond just mitigating the spread of Covid.

I hope this series of blog posts helps people understand the cultural and social differences between the US and China and encourages readers to think about the reasons behind policies and practices in both countries.

On Work Culture Differences Between The East and The West

The non-stop work culture in different forms permeates not just China but also the rest of East Asia, including Taiwan, South Korea, and Japan. While it may not be called 9-9-6 in those countries, there are definitely forms of working long hours and not questioning the work culture or ethic. This ties back to the idea of the social contract.

In my subsequent articles, I want to talk about how 9-9-6 came to be and how Confucianism and the traditional work ethic of children from a young age created a culture of being online at all times or being in the office for 12 hours a day. I also want to discuss how this has impacted the tech industry, millennials, and Generation Z in China and how things are improving.

Many companies in China have made progress in the past two years in getting rid of the 9-9-6 culture and the culture of just focusing on engineering work without asking questions or presenting new ideas. There has been a lot of change in a short period of time.

In China, there is a big problem with the engineering culture and innovation because of the 9-9-6 culture and underlying values that cause engineers to not question how things are done or the current architecture. However, progress is being made on this front.

One thing that makes the US successful in tech is the ability to question everything and improve processes. This is still missing to a great extent in the Chinese tech scene, but there is a major shift happening toward critical thinking and problem-solving.

We can learn things from the Chinese way of doing things and vice versa. I plan to write a series on this topic in Chinese as well.

On Joining Zilliz

Zilliz is now a global company with headquarters located in San Francisco Bay Area. One of our main goals is to promote the use of vector databases to users in the Bay Area and all over the world, including APAC, Europe, and EMEA. We hold a unique position in the industry today, thanks to the numerous applications of vector databases and AI/ML that we have been able to take advantage of early on.

I recalled instances where our industry had made mistakes in the past, such as the auto-tagging solution for photos on Flickr that was based on deep neural networks trained internally at Yahoo. One particular incident that has stuck in my mind was an inappropriate tag given to a photo of a concentration camp in barbed wires. These types of missteps in AI/ML have led the industry to scale back model deployments and be more cautious with the use of AI/ML.

In China, there was less restriction on the use of these technologies, and companies were eager to experiment with them. Zilliz was in a unique position to take advantage of this eagerness, and early on, some users wanted to index millions of vectors or embeddings in Milvus, the vector database created by Zilliz.

This opportunity allowed the Zilliz team to scale out the technology and build something that was tested in production in many user scenarios very early on. This is reflected in the maturity of our vector database technology today.

Overall, the work that Zilliz does with vector databases and AI/ML resonates deeply with the experiences and goals of our team members, such as myself, who have worked on computer vision and machine learning teams in the past. It is a big reason why we are at Zilliz today.

On Vector Databases

Many modern machine learning models, AI applications, and other applications that utilize deep neural networks rely on embeddings taken from intermediate layers of the model to create strong representations of input data. For example, if you have two images of German Shepherds or two images of the Transamerica Pyramid in San Francisco, their embeddings from a properly trained image recognition model would be very close to each other in terms of Euclidean or cosine distance.

This is where vector databases come in. They are designed to store high-dimensional tensors or embeddings from intermediate layers in machine learning models. This allows you to process unstructured data like images, video, audio, and text, as well as lesser-known forms of unstructured data like graphs, geospatial data, and protein structures. By embedding different types of unstructured data into a single space, you can perform nearest-neighbor searches to find content by its semantics.

Traditional relational databases, NoSQL databases, object databases, and wide column stores are designed to store traditional structured data. But with the rise of unstructured data, vector databases like Milvus are becoming increasingly important.

Milvus is an open-source vector database created by Zilliz that is currently in early access for its managed service. Zilliz aims to help developers and organizations process vast amounts of unstructured data through embeddings and turn Milvus into a top-level project.

On Milvus

Milvus 1.0 was primarily a vector database designed to function as a single instance. It had limitations on the amount of resources that could be allocated to it. For instance, if you wanted to index a billion vectors with Milvus 1.0, you would need to scale out your machine. This could involve adding more RAM, CPU, GPU, or some other form of accelerator to improve indexing.

Milvus 2.0 takes a very different approach. It was architected from the ground up and is focused on being production-ready and cloud-native. We have a separate storage layer from the compute layer, indexing, and querying. These two clusters can be scaled independently of each other. For example, if your application requires more querying than updating, you can scale out your query nodes. Milvus 2.0 will automatically manage this for you. The same applies to indexing and data clusters.

We try to make each component of Milvus scalable and flexible to enable users to deploy various applications with Milvus 2.0. We have features you expect in a modern database, such as replication, failover, and scalability. Milvus 2.0 can be deployed in cluster mode or as a standalone instance.

At Zilliz, we have built a managed cloud platform called Zilliz Cloud around Milvus 2.0. This architecture has been adopted by a large number of our users who have deployed Milvus 2.0 in cluster mode. I encourage everyone to check out Zilliz Cloud.

On Unique Use Cases

I will discuss three of my favorite Milvus use cases, all of which highlight the versatility and uniqueness of vector databases.

The first use case involves Trend Micro, a company that sought to improve its antivirus and cybersecurity capabilities by detecting malware in APKs. Trend Micro used Milvus to generate individual features from the APKs, which were then indexed for threat detection. This use case is interesting because it deviates from traditional vector databases applications like reverse image search or semantic textual search.

The second use case involved the Cleveland Museum of Art, which launched an AI Art Lens using Milvus for reverse image search. While this is a more common use case for vector databases, what stood out was the museum's ability to bring the system online with minimal effort and without a large engineering team.

The third use case relates to new drug discovery, where users applied Milvus to perform molecular similarity and 3D molecular search to identify potential drugs for tackling specific symptoms. This use case is unique because it is not a traditional application for vector databases and highlights the power of ML infrastructure.

These three use cases demonstrate the power and versatility of vector databases and ML infrastructure. While there are many other application scenarios for vector databases, these three stand out because they are unique and allow users to deploy their systems quickly and efficiently.

On Towhee

Towhee is intended to be part of the greater Vector database ecosystem, which includes Milvus. Many users have application-level code that lets them generate their own embeddings and use them as input into Milvus for future queries. However, some users may not have the expertise for embeddings, machine learning, or AI, and that's where Towhee comes in.

Towhee aims to be the upstream data pipeline or embedding data-to-vector platform that leads into Milvus. It's like a new ETL platform for unstructured data, turning it into embeddings or generating tags from videos or images. Towhee's architecture is not as complicated as Milvus's, but it provides hundreds of different operators prepackaged on our Towhee hub. An operator can be as simple as an image transformation or as complex as a full-fledged machine learning model. These individual operators can be chained together to form a pipeline.

The Towhee engine is responsible for automatically optimizing data collection and running that in production or in a scalable environment. The data collection API provides a Pythonic API that allows users to create their ETL pipeline from operators or pre-built pipelines.

Milvus and Towhee are two independent projects, but they aim to achieve a one-plus-one effect, where Towhee plus Milvus is greater than the sum of its parts. While Milvus is composed of many more moving pieces and requires a lot of engineering work to optimize, Towhee is a much simpler architecture.

On The Embedding Tooling Landscape

I think the industry is moving in a direction where we can embed various types of unstructured data, which we already can do today.

But what will be unique about the future is that we can embed more and more unstructured data into the same embedding space. For those familiar with models such as CLIP or multimodal learning, we may one day reach a point where we can embed tons of different unstructured data into the same space. It is foreseeable that we could embed video, images, and text all into the same space one day. We could even embed text and protein structures into the same space. With that type of embedding, we could do AI drug discovery by simply typing out the symptom or biological process we are looking for.

I think this is the direction of a lot of machine learning work we see today. Things are moving from a single type of data to multiple different types of data being embedded into the same space.

With this, more and more exciting applications will come, especially when it comes to Milvus being able to index a variety of different types of unstructured data and search through all of those different types of unstructured data with different forms of queries.

I could query across an image, but I could also query across my vector database using embeddings from the image or embeddings from the text. I think that's really exciting, and I believe it is the evolution of where we are going today.

On Zilliz Cloud

Zilliz Cloud is a managed version of Milvus, providing an easy way for users to set up a vector database. Users can access the vector database through an API or user interface. While Milvus will always be open source and available for on-prem deployment, Zilliz Cloud leverages public cloud infrastructure, such as AWS, Azure, and GCP, to create a version of Milvus that is easily accessible to a variety of users and organizations.

The ultimate goal of Zilliz Cloud is to onboard other open-source projects that are developed in-house. Towhee, an embedding generation framework, is one such project. Zilliz Cloud will take care of the backend and compute, allowing users to quickly obtain embeddings for use in Milvus. Joint solutions, such as Milvus plus Towhee running in a single framework, may also be available in the future.