Datacast

Episode 58: Deep Learning Meets Distributed Systems with Jim Dowling

Episode Summary

Jim Dowling is the CEO of Logical Clocks AB, an Associate Professor at KTH Royal Institute of Technology, and a Senior Researcher at SICS RISE in Stockholm. His research concentrates on building systems support for machine learning at scale. He is the lead architect of Hops Hadoop, the world's fastest and most scalable Hadoop distribution and only Hadoop platform with support for GPUs as a resource. He is also a regular speaker at Big Data and AI industry conferences.

Episode Notes

Show Notes

Jim’s Contact Info

Mentioned Content

Research Papers

Articles

Projects

People

Programming Books

Episode Transcription

Key Takeaways

Here are highlights from my conversation with Jim:

On Distributed Systems Research

My interest back in university was first in AI and second in distributed systems. But there were Ph.D. positions in distributed systems available, so I pursued that. I had the pleasure of having Vinny Cahill as my supervisor.

I worked on software engineering for distributed systems, something called “reflective programming.”

While working on statically-typed programming languages (such as C++ and Java), I developed the K-Component model that makes the software systems adapt to a changing environment. Additionally, I used distributed reinforcement learning to build the intelligent component on top of the system to sense the environment and react to changes.

Combining reinforcement learning and ant colony optimization, our research lab tackled the routing problem — where we built an ad-hoc routing network in Trinity College Dublin.

On RISE Research Institute of Sweden

When I joined, it was called the Swedish Institute of Computer Science but was recently re-branded to the Research Institute of Sweden. Many European countries have nationally-funded research labs, and RISE is Sweden’s nationally-funded research organization. RISE was designed to conduct more practical and applied projects than universities.

I originally applied to get money for reinforcement learning but couldn’t get it. Peer-to-peer systems were kind of similar to collaborative reinforcement learning agents, so I got money for that. The lesson here is that the funding research agencies often have a huge influence on how the research in different countries follows. Furthermore, AI had not been funded in Sweden very well, compared to the UK or Ireland, where I came from.

On Building Peer-To-Peer Systems

I worked on something called gradient topology, a simple way of organizing nodes.

The goal in distributed systems research is to get into systems conferences, which means building real-world systems. That’s why it usually takes a long time to do a Ph.D. in distributed systems. While building live systems, we learned a lot about how to encode streams and serialize/deserialize them efficiently. We also partnered with big companies in Stockholm like Ericsson and Scandia to bring our fundamental research into their live systems.

On Teaching at KTH Royal Institute of Technology

My department, the Division of Software and Computer Systems, has a mix of people in distributed systems, programming languages, and real-time embedded systems. We got along very well, as we all focus on building real-world systems.

My course on Deep Learning on Big Data is an interesting one.

On Distributed Deep Learning

At the time (2017), I have been teaching the Deep Learning on Big Data course for two iterations. We had students with backgrounds in both distributed systems and pure AI. The main weakness I saw in the latter group is that they did not enjoy writing distributed programs.

GPUs were not getting faster, so to reduce training time, we need to go distributed (from 1 GPU to many GPUs to distributed GPUs). There are two types of distributed training:

At the time, people were accustomed to the parameter-server architecture, in which there exists central “parameter servers” to collect all updates. The big bottleneck is that even if these servers are sharded, the workers are not using all available bandwidth. A better option is known as the Ring-AllReduce architecture, in which the workers are organized in a ring. The workers send their gradients to their neighbors in the ring, so the gradients travel around the ring in this fashion.

The Deep Learning hierarchy of scale says that: You start at the bottom training one GPU with weak scaling. Then, as you move up the ladder, you use distributed training with parameter-server or Ring-AllReduce architectures. At the top, you use a lot of GPUs to reduce your training time. All the ladders follow the same principle of sharing gradient updates between workers.

On Developing HopsFS

HopsFS came about because I worked on MySQL Cluster — a distributed memory database, which is a hammer that I learned how to use. What made it unique was that it stores data in-memory across many nodes, meaning that even if a node crashes, the database stays up. While examining the Hadoop File Systems (HDFS), all the metadata for the file system were stored on the heap of a single Java virtual machine. That means it was not highly available (if that machine goes down, your 5000-node Hadoop cluster also goes down).

HopsFS was a long research project that started in 2012 and resulted in a seminal paper in 2017. Essentially, we replaced the metadata server with a distributed system. We had a stateless server to handle the requests and a MySQL cluster backend to store the filesystem data.

HopsFS was quite successful and being run in production at many companies in on-premises hardware. However, we would like people to use HopsFS on the cloud, where S3 storage is much cheaper. The problem with S3 (identified in my blog post) is that people treat it like a filesystem, which is inconsistent at times. Another classic limitation is atomic rename, which S3 cannot do. This is an issue because all of the existing scalable SQL systems are built on this simple primitive.

Recently, we launched HopsFS as a layer, where the data for HopsFS is stored in S3 buckets. Users can get the benefits of atomic rename, consistent file listing, and the ability to store a large amount of data in the metadata layer.

These are the basic advances that we needed to make at the systems level to build a feature store — enabling data scientists to manage their features both for online predictions and offline training.

On Building Hopsworks

When we built the scalable metadata layer into HopsFS, we attempted to solve how to store sensitive data on a shared cluster. In particular, we created a project-based multi-tenant security model using a lot of metadata. That was the basis for the Hopsworks platform to let data scientists work with big data.

HopsFS itself had solid research results (16x throughput of HDFS and millions of operations per second). We thought that VC would give us money for this! But no chance… We couldn’t raise any money for HopsFS, at least in Europe.

The data science platform Hopswork is actually more interesting: We had a good number of users. AI platforms were all the hype at the time. We found 3 VCs that took a bet on us for our seed funding in September 2018 — Inventure in Finland, Frontline Ventures in Ireland, and AI Seed in London. Our first major customer after the commercialization was the largest bank in Sweden called Swedbank.

On Public Research vs. VC-Funded Money

I met many people in the startup community who think that public money is a terrible waste and nothing good gets done. But in the big data space, let’s take the Apache Spark project, for example. It came from public money (developed by PhDs and post-docs at Berkeley) and was not developed in the industry (because it would take too long even to get started).

At RISE, we worked on the first version of streaming data for Apache Flink (developed at TU Berlin), another successful project that was not developed in the industry. The same can be said for our filesystem, HopsFS, that tackles scalable metadata, a well-known problem in the industry.

Because the probability of success was too low and the expected time to solve the problem was too high, the industry wouldn’t invest. Still, public money could attack these difficult problems. VC funds have a timeline: they have to cash out after 5–10 years, so they don’t want to invest in anything with a long timeline (even if there can be a high reward at the end).

Furthermore, the success probability for a lot of research problems is not always that high. But I believe there should still be strong support for basic research to develop new systems; otherwise, these won’t be done by industry. Most incremental improvements are made by industry, but public research is the way to go if you want paradigm changes.

On Building a Feature Store

After reading the original feature store article from Uber in September 2017, I told my team that we needed a feature store right away. It took us more than a year to build our own. The original Uber’s Michelangelo article talks about feature engineering with a domain-specific language (DSL). If you are a data scientist, you learn this DSL and compute your features with whatever capabilities this DSL provides you with. If you need new capabilities, you are out of luck.

Given my background in programming languages back in my Ph.D., I knew that eventually, general-purpose programming languages would edge any DSL. For data science, Python is the default programming language of choice. Thus, we decided to go with data frames as the API to our feature store. As you know, data frames are available on Pandas, but also on Spark. Therefore, we provided both Python-based and Spark-based clients to our feature store. The data scientist can perform feature engineering in the supported languages (Python, Java, or Scala).

A central problem in data science is caching features and making them available. A feature store can store features and cache the pre-computed features. Then, you can use them to create training data or directly in the model that is being served (with low-latency).

Hopsworks is the first open-source and full-stack feature store, as far as I know. There is one called Feast, but it was built on top of Google Cloud products.

On Designing a Feature Store

A feature store isn’t for individual Python developers who typically want to cache features for usage. A feature store is often adopted by large enterprises with multiple models that use the same data. If every time someone starts a new project and writes feature engineering code to compute features, there is a huge duplication of effort. Instead, with a feature store, he/she can immediately see features that are ready to use and join them together to create training data. Once features are reused, the cost of rewriting data pipelines to get new features reduces, enabling the organizations to be more effective.

With a feature store, you have two pipelines effectively.

If you do CI/CD, you must pay attention to versioning. Classical CI/CD talks about code versioning, while model CI/CD also has the data used to train the model. So you do need data versioning at some level.

On Data Warehouses vs. Feature Stores

In a normal data warehouse, the data gets overwritten. In a feature store, you can use time-travel capability to store the data commit history and go back in time to look at the feature values.

A data warehouse’s natural language interface is SQL. A feature store’s natural language interface is Python. We know from our experience building an earlier version of Hopsworks that data scientists do not want to learn to write SQL joins. That’s not their natural habitat.

Another benefit of the feature store is its ability to provide low-latency access to features during model serving. A traditional data warehouse won’t be able to return hundreds of features in milliseconds. A feature store serves as a dual database:

On Product Roadmap for Hopsworks

Customer feedback is key. Because of Logical Clocks’ link to the university, we still have a strategic long-term vision for the product and the integration needed for every system under the sun.

We have this framework called Maggy for the longer-term roadmap, which unifies distributed training and hyper-parameter tuning with just normal single-host training. We also integrated Maggy with our experiment tracking system, which separates the assets (models, training datasets, and feature groups) stored in a filesystem and the metadata stored in the same unified metadata layer.

You write one function with the training code. Then, you decide to run it either on a single GPU or across many servers (for tuning hyper-parameters). That same function will return a dictionary that logs all the training information.

The main focus of the company in 2021 is to be a good citizen in the Azure, AWS, and Databricks ecosystem. Furthermore, we are collaborating with RISE on the Apache Flink project to convert Flink’s local data storage (RocksDB) to our go-to tool of MySQL’s NDB. Flink also supports real-time feature engineering, so a part of our effort is to develop the best feature engineering/feature store platform on the market.

On Getting Early Customers for Logical Clocks

The real hurdle for anybody who develops products is getting the first user. Logical Clocks has the backing of the university and the research institute, so we recruit a good number of folks from there to try our product — getting it into a usable state.

The next phase of crossing the chasm is getting customers to believe in us. Having a developer base helped us get customers. I think the starting point for anyone who develops an open-source project and wants it to be adopted/commercialized is to write great documentation.

On Being a Professor vs. A Founder

In terms of similarities:

In terms of differences:

On The European Tech Ecosystem

I wrote a blog post basically complaining that our friend at Data Artisan (who developed the Apache Flink project) got acquired by Alibaba for cheap ($100-million transaction). Given that they have the world’s leading streaming platform, if they were based in Silicon Valley, they would have been able to raise more money and grow to become a magnet for talent.

European companies are great at developing the initial technology. But there is a cultural chasm that prevents them from growing into huge companies that commercialize the tech and draws more talent inward. Europeans, in general, value the quality of life and tend to sell out early. This is bad for Europe because Europe is really, really weak on the Internet/Big Data/AI space.