Datacast

Episode 58: Deep Learning Meets Distributed Systems with Jim Dowling

Episode Summary

Jim Dowling is the CEO of Logical Clocks AB, an Associate Professor at KTH Royal Institute of Technology, and a Senior Researcher at SICS RISE in Stockholm. His research concentrates on building systems support for machine learning at scale. He is the lead architect of Hops Hadoop, the world's fastest and most scalable Hadoop distribution and only Hadoop platform with support for GPUs as a resource. He is also a regular speaker at Big Data and AI industry conferences.

Episode Notes

Show Notes

(1:56) Jim went over his education at Trinity College Dublin in the late 90s/early 2000s, where he got early exposure to academic research in distributed systems.
(4:26) Jim discussed his research focused on dynamic software architecture, particularly the K-Component model that enables individual components to adapt to a changing environment.
(5:37) Jim explained his research on collaborative reinforcement learning that enables groups of reinforcement learning agents to solve online optimization problems in dynamic systems.
(9:03) Jim recalled his time as a Senior Consultant for MySQL.
(9:52) Jim shared the initiatives at the RISE Research Institute of Sweden, in which he has been a researcher since 2007.
(13:16) Jim dissected his peer-to-peer systems research at RISE, including theoretical results for search algorithm and walk topology.
(15:30) Jim went over challenges building peer-to-peer live streaming systems at RISE, such as GradientTV and Glive.
(18:18) Jim provided an overview of research activities at the Division of Software and Computer Systems at the School of Electrical Engineering and Computer Science at KTH Royal Institute of Technology.
(19:04) Jim has taught courses on Distributed Systems and Deep Learning on Big Data at KTH Royal Institute of Technology.
(22:20) Jim unpacked his O’Reilly article in 2017 called “Distributed TensorFlow,” which includes the deep learning hierarchy of scale.
(29:47) Jim discussed the development of HopsFS, a next-generation distribution of the Hadoop Distributed File System (HDFS) that replaces its single-node in-memory metadata service with a distributed metadata service built on a NewSQL database.
(34:17) Jim rationalized the intention to commercialize HopsFS and built Hopsworks, an user-friendly data science platform for Hops.
(36:56) Jim explored the relative benefits of public research money and VC-funded money.
(41:48) Jim unpacked the key ideas in his post “Feature Store: The Missing Data Layer in ML Pipelines.”
(47:31) Jim dissected the critical design that enables the Hopsworks feature store to refactor a monolithic end-to-end ML pipeline into separate feature engineering and model training pipelines.
(52:49) Jim explained why data warehouses are insufficient for machine learning pipelines and why a feature store is needed instead.
(57:59) Jim discussed prioritizing the product roadmap for the Hopswork platform.
(01:00:25) Jim hinted at what’s on the 2021 roadmap for Hopswork.
(01:03:22) Jim recalled the challenges of getting early customers for Hopsworks.
(01:04:30) Jim intuited the differences and similarities between being a professor and being a founder.
(01:07:00) Jim discussed worrying trends in the European Tech ecosystem and the role that Logical Clocks will play in the long run.
(01:13:37) Closing segment.

Jim’s Contact Info

Mentioned Content

Research Papers

Articles

“Distributed TensorFlow” (2017)
“Reflections on AWS’s S3 Architectural Flaws” (2017)
“Meet Michelangelo: Uber’s Machine Learning Platform” (2017)
“Feature Store: The Missing Data Layer in ML Pipelines” (2018)
“What Is Wrong With European Tech Companies?” (2019)
“ROI of Feature Stores” (2020)
“MLOps With A Feature Store” (2020)
“ML Engineer Guide: Feature Store vs. Data Warehouse” (2020)
“Unifying Single-Host and Distributed Machine Learning with Maggy” (2020)
“How We Secure Your Data With Hopsworks” (2020)
“One Function Is All You Need For ML Experiments” (2020)
“Hopsworks: World’s Only Cloud-Native Feature Store, now available on AWS and Azure” (2020)
“Hopsworks 2.0: The Next Generation Platform for Data-Intensive AI with a Feature Store” (2020)
“Hopsworks Feature Store API 2.0, a new paradigm” (2020)
“Swedish startup Logical Clocks takes a crack at scaling MySQL backend for live recommendations” (2021)

Projects

Apache Hudi (by Uber)
Delta Lake (by Databricks)
Apache Iceberg (by Netflix)
MLflow (by Databricks)
Apache Flink (by The Apache Foundation)

People

Leslie Lamport (The Father of Distributed Computing)
Jeff Dean (Creator of MapReduce and TensorFlow, Lead of Google AI)
Richard Sutton (The Father of Reinforcement Learning — who wrote “The Bitter Lesson”)

Programming Books

C++ Programming Languages books (by Scott Meyers)
“Effective Java” (by Joshua Bloch)
“Programming Erlang” (by Joe Armstrong)
“Concepts, Techniques, and Models of Computer Programming” (by Peter Van Roy and Seif Haridi)

Episode Transcription

Key Takeaways

Here are highlights from my conversation with Jim:

On Distributed Systems Research

My interest back in university was first in AI and second in distributed systems. But there were Ph.D. positions in distributed systems available, so I pursued that. I had the pleasure of having Vinny Cahill as my supervisor.

I worked on software engineering for distributed systems, something called “reflective programming.”

We were building distributed systems that enable the programmers to write objects that have access over the network, persist their state to disk, and maybe low-bound. In some sense, we were building intelligent objects, which eventually led me back to AI during my Ph.D.
While many distributed systems researchers built rules-based systems, my affinity towards AI encouraged me to look at reinforcement learning (RL) as a technique to make these objects more intelligent — naturally moving from classical RL to distributed RL.

While working on statically-typed programming languages (such as C++ and Java), I developed the K-Component model that makes the software systems adapt to a changing environment. Additionally, I used distributed reinforcement learning to build the intelligent component on top of the system to sense the environment and react to changes.

Combining reinforcement learning and ant colony optimization, our research lab tackled the routing problem — where we built an ad-hoc routing network in Trinity College Dublin.

We created a routing algorithm that enables people walking around to find the shortest path from one end to the other of the network.
More specifically, we built a model-based RL method for how individual RL agents could collaborate by sending packets to one another. Then, we added pheromones (from ant colony optimization) to the RL algorithm — so that if the algorithm converges to a solution, it sticks to that solution even if the agents move and the environment becomes non-stationary.

On RISE Research Institute of Sweden

When I joined, it was called the Swedish Institute of Computer Science but was recently re-branded to the Research Institute of Sweden. Many European countries have nationally-funded research labs, and RISE is Sweden’s nationally-funded research organization. RISE was designed to conduct more practical and applied projects than universities.

I originally applied to get money for reinforcement learning but couldn’t get it. Peer-to-peer systems were kind of similar to collaborative reinforcement learning agents, so I got money for that. The lesson here is that the funding research agencies often have a huge influence on how the research in different countries follows. Furthermore, AI had not been funded in Sweden very well, compared to the UK or Ireland, where I came from.

On Building Peer-To-Peer Systems

I worked on something called gradient topology, a simple way of organizing nodes.

If you take peers in a peer-to-peer system and feed each of them a single scalar value, it talks to their neighbors and preferentially connects with nodes that have higher scalar value than itself. If all nodes follow that rule, they end up forming a gradient topology.
Obviously, this depends on the distribution of the scalar values in the network. The gradient's property is such that the distance from a node at the edge of this circle to the center will be on the order of log-n Hops. The idea here was that the nodes in the gradient center would provide services since they are the better nodes with higher utility values. The nodes will self-organize to do this.

The goal in distributed systems research is to get into systems conferences, which means building real-world systems. That’s why it usually takes a long time to do a Ph.D. in distributed systems. While building live systems, we learned a lot about how to encode streams and serialize/deserialize them efficiently. We also partnered with big companies in Stockholm like Ericsson and Scandia to bring our fundamental research into their live systems.

On Teaching at KTH Royal Institute of Technology

My department, the Division of Software and Computer Systems, has a mix of people in distributed systems, programming languages, and real-time embedded systems. We got along very well, as we all focus on building real-world systems.

My course on Deep Learning on Big Data is an interesting one.

I applied to teach that course in 2014 at the European Institute of Technology but did not teach it until 2016. I was shocked that it was the first Deep Learning course in Sweden.
With access to the Hopsworks platform, students could train models on real GPU with a huge amount of data (thanks to RISE’s resources). Thus, they had a unique opportunity to not just work on MNIST but real datasets.

On Distributed Deep Learning

At the time (2017), I have been teaching the Deep Learning on Big Data course for two iterations. We had students with backgrounds in both distributed systems and pure AI. The main weakness I saw in the latter group is that they did not enjoy writing distributed programs.

GPUs were not getting faster, so to reduce training time, we need to go distributed (from 1 GPU to many GPUs to distributed GPUs). There are two types of distributed training:

Weak Scaling: We can have lots of GPUs to do things like searching for good hyper-parameters by running many parallel trials. Those trials are run independently from each other. You send it to the worker, run the trial locally, and send the results back to an optimizer that will find the best combination of hyper-parameters.
Strong Scaling: We connect lots of GPUS together. Next, we use an algorithm by which each worker will train on a subset of the training data and share the gradient updates to make changes to the model. This is also known as “data-parallel training,” where there is a copy of every worker's model.

At the time, people were accustomed to the parameter-server architecture, in which there exists central “parameter servers” to collect all updates. The big bottleneck is that even if these servers are sharded, the workers are not using all available bandwidth. A better option is known as the Ring-AllReduce architecture, in which the workers are organized in a ring. The workers send their gradients to their neighbors in the ring, so the gradients travel around the ring in this fashion.

The Deep Learning hierarchy of scale says that: You start at the bottom training one GPU with weak scaling. Then, as you move up the ladder, you use distributed training with parameter-server or Ring-AllReduce architectures. At the top, you use a lot of GPUs to reduce your training time. All the ladders follow the same principle of sharing gradient updates between workers.

On Developing HopsFS

HopsFS came about because I worked on MySQL Cluster — a distributed memory database, which is a hammer that I learned how to use. What made it unique was that it stores data in-memory across many nodes, meaning that even if a node crashes, the database stays up. While examining the Hadoop File Systems (HDFS), all the metadata for the file system were stored on the heap of a single Java virtual machine. That means it was not highly available (if that machine goes down, your 5000-node Hadoop cluster also goes down).

HopsFS was a long research project that started in 2012 and resulted in a seminal paper in 2017. Essentially, we replaced the metadata server with a distributed system. We had a stateless server to handle the requests and a MySQL cluster backend to store the filesystem data.

HopsFS was quite successful and being run in production at many companies in on-premises hardware. However, we would like people to use HopsFS on the cloud, where S3 storage is much cheaper. The problem with S3 (identified in my blog post) is that people treat it like a filesystem, which is inconsistent at times. Another classic limitation is atomic rename, which S3 cannot do. This is an issue because all of the existing scalable SQL systems are built on this simple primitive.

Recently, we launched HopsFS as a layer, where the data for HopsFS is stored in S3 buckets. Users can get the benefits of atomic rename, consistent file listing, and the ability to store a large amount of data in the metadata layer.

These are the basic advances that we needed to make at the systems level to build a feature store — enabling data scientists to manage their features both for online predictions and offline training.

On Building Hopsworks

When we built the scalable metadata layer into HopsFS, we attempted to solve how to store sensitive data on a shared cluster. In particular, we created a project-based multi-tenant security model using a lot of metadata. That was the basis for the Hopsworks platform to let data scientists work with big data.

HopsFS itself had solid research results (16x throughput of HDFS and millions of operations per second). We thought that VC would give us money for this! But no chance… We couldn’t raise any money for HopsFS, at least in Europe.

The data science platform Hopswork is actually more interesting: We had a good number of users. AI platforms were all the hype at the time. We found 3 VCs that took a bet on us for our seed funding in September 2018 — Inventure in Finland, Frontline Ventures in Ireland, and AI Seed in London. Our first major customer after the commercialization was the largest bank in Sweden called Swedbank.

On Public Research vs. VC-Funded Money

I met many people in the startup community who think that public money is a terrible waste and nothing good gets done. But in the big data space, let’s take the Apache Spark project, for example. It came from public money (developed by PhDs and post-docs at Berkeley) and was not developed in the industry (because it would take too long even to get started).

At RISE, we worked on the first version of streaming data for Apache Flink (developed at TU Berlin), another successful project that was not developed in the industry. The same can be said for our filesystem, HopsFS, that tackles scalable metadata, a well-known problem in the industry.

Because the probability of success was too low and the expected time to solve the problem was too high, the industry wouldn’t invest. Still, public money could attack these difficult problems. VC funds have a timeline: they have to cash out after 5–10 years, so they don’t want to invest in anything with a long timeline (even if there can be a high reward at the end).

Furthermore, the success probability for a lot of research problems is not always that high. But I believe there should still be strong support for basic research to develop new systems; otherwise, these won’t be done by industry. Most incremental improvements are made by industry, but public research is the way to go if you want paradigm changes.

On Building a Feature Store

After reading the original feature store article from Uber in September 2017, I told my team that we needed a feature store right away. It took us more than a year to build our own. The original Uber’s Michelangelo article talks about feature engineering with a domain-specific language (DSL). If you are a data scientist, you learn this DSL and compute your features with whatever capabilities this DSL provides you with. If you need new capabilities, you are out of luck.

Given my background in programming languages back in my Ph.D., I knew that eventually, general-purpose programming languages would edge any DSL. For data science, Python is the default programming language of choice. Thus, we decided to go with data frames as the API to our feature store. As you know, data frames are available on Pandas, but also on Spark. Therefore, we provided both Python-based and Spark-based clients to our feature store. The data scientist can perform feature engineering in the supported languages (Python, Java, or Scala).

A central problem in data science is caching features and making them available. A feature store can store features and cache the pre-computed features. Then, you can use them to create training data or directly in the model that is being served (with low-latency).

Hopsworks is the first open-source and full-stack feature store, as far as I know. There is one called Feast, but it was built on top of Google Cloud products.

On Designing a Feature Store

A feature store isn’t for individual Python developers who typically want to cache features for usage. A feature store is often adopted by large enterprises with multiple models that use the same data. If every time someone starts a new project and writes feature engineering code to compute features, there is a huge duplication of effort. Instead, with a feature store, he/she can immediately see features that are ready to use and join them together to create training data. Once features are reused, the cost of rewriting data pipelines to get new features reduces, enabling the organizations to be more effective.

With a feature store, you have two pipelines effectively.

You have the raw data to the engineered features, and you stop there. This “feature engineering” pipeline goes from your backend data warehouse/data lake, compute features, and store those features in the feature store.
At the feature store, you start over. In this “model training” pipeline, you select the relevant features, create the training data, train the model, test the model, and deploy it into production.
To conflate these two phases and say that there is an end-to-end pipeline means crossing too many disciplines: from data engineering at the beginning that requires experts at data pipelines and data quality; to model development that requires experts at machine learning and specific domains.

If you do CI/CD, you must pay attention to versioning. Classical CI/CD talks about code versioning, while model CI/CD also has the data used to train the model. So you do need data versioning at some level.

Many data versioning frameworks in the market are Git-based. When you have a pipeline accepting new data requests daily/hourly, the cost of copying the data grows linearly.
There are 3 frameworks addressing this problem for a large volume of data: Apache Hudi (by Uber), Delta Lake (by Databricks), and Apache Iceberg (by Netflix). All of them utilize meta-data on top of Parquet. Currently, Hopsworks supports Hudi and Delta Lake to version the data in our feature store.
For every time that your pipeline makes an update to your dataset (a “feature group”), Hopsworks can get the diff between those updates. This is a handy capability when developing data science applications.

On Data Warehouses vs. Feature Stores

In a normal data warehouse, the data gets overwritten. In a feature store, you can use time-travel capability to store the data commit history and go back in time to look at the feature values.

A data warehouse’s natural language interface is SQL. A feature store’s natural language interface is Python. We know from our experience building an earlier version of Hopsworks that data scientists do not want to learn to write SQL joins. That’s not their natural habitat.

Another benefit of the feature store is its ability to provide low-latency access to features during model serving. A traditional data warehouse won’t be able to return hundreds of features in milliseconds. A feature store serves as a dual database:

An online transaction processing system (a key-value store) stores the features for serving and gives you low-latency access to the features.
An OLAP/columnar database stores the large volume of versioned features for creating training data and scoring batch predictions.

On Product Roadmap for Hopsworks

Customer feedback is key. Because of Logical Clocks’ link to the university, we still have a strategic long-term vision for the product and the integration needed for every system under the sun.

We have this framework called Maggy for the longer-term roadmap, which unifies distributed training and hyper-parameter tuning with just normal single-host training. We also integrated Maggy with our experiment tracking system, which separates the assets (models, training datasets, and feature groups) stored in a filesystem and the metadata stored in the same unified metadata layer.

You write one function with the training code. Then, you decide to run it either on a single GPU or across many servers (for tuning hyper-parameters). That same function will return a dictionary that logs all the training information.

The main focus of the company in 2021 is to be a good citizen in the Azure, AWS, and Databricks ecosystem. Furthermore, we are collaborating with RISE on the Apache Flink project to convert Flink’s local data storage (RocksDB) to our go-to tool of MySQL’s NDB. Flink also supports real-time feature engineering, so a part of our effort is to develop the best feature engineering/feature store platform on the market.

On Getting Early Customers for Logical Clocks

The real hurdle for anybody who develops products is getting the first user. Logical Clocks has the backing of the university and the research institute, so we recruit a good number of folks from there to try our product — getting it into a usable state.

The next phase of crossing the chasm is getting customers to believe in us. Having a developer base helped us get customers. I think the starting point for anyone who develops an open-source project and wants it to be adopted/commercialized is to write great documentation.

On Being a Professor vs. A Founder

In terms of similarities:

Both jobs require long working hours.
Both jobs require fundraising. Going to investors and raising money for research grants are the same skills.

In terms of differences:

As a professor, I have tremendous freedom and have been lucky enough in my career to switch areas from programming languages to distributed systems to AI.
In industry, I don’t have the same freedom, but the impact can be potentially tremendous.

On The European Tech Ecosystem

I wrote a blog post basically complaining that our friend at Data Artisan (who developed the Apache Flink project) got acquired by Alibaba for cheap ($100-million transaction). Given that they have the world’s leading streaming platform, if they were based in Silicon Valley, they would have been able to raise more money and grow to become a magnet for talent.

European companies are great at developing the initial technology. But there is a cultural chasm that prevents them from growing into huge companies that commercialize the tech and draws more talent inward. Europeans, in general, value the quality of life and tend to sell out early. This is bad for Europe because Europe is really, really weak on the Internet/Big Data/AI space.