Kubernetes : the runtime for AI

Lucian Cioranu
Co-Founder @ DataGrid Software

PROGRAMMING

Kubernetes : the runtime for AI

Having an AI solution, beyond any software and techniques, is about time spent in obtaining meaningful results, and time translates into hardware requirements and management. These requirements may vary depending on the task at hand. In some cases, a model can be trained on a single CPU/GPU machine, while in other cases you will need a highly scalable infrastructure to cope with the load that can be either influenced by the amount of data, or the techniques used to train the model. Having an infrastructure that can quickly adapt to the load is essential in getting the results, and this is precisely what Kuberneetes is all about.

To explain what Kubernetes is and why it is relevant for AI, let's look at what building an AI solution requires:

training the model: it may take many hours (even days) and it may require specialized hardware
managing the models: usually more than one model needs to be trained, tested and deployed. Deploying a new model must be made in a non-disruptive way and you must be able to test, monitor and possibly revert your changes easily.
using the model: predictions need to be made quickly, using a multitude of models
availability and scalability: a good model is useless if it's not easily accessible by users and applications
efficiency and cost: you need to balance all the performance requirements with the budget constraints
security: this is listed last but it should be first

There are many things to talk about Kubernetes on what it is or what it is not. In this article I'll walk you through general aspects of Kubernetes and how it is used to build fast, scalable and reliable solutions. This is by no means meant to be a Kubernetes tutorial but rather to give you a high-level view on why it is a platform suited for AI.

All things have a start and this start is the need for containerization. What is containerization and why do we need it? Well, those who remember the "good" old days of building and deploying applications will also remember the pains of managing and scaling these applications. One fundamental problem was that the processes that we deployed had access to all system resources with few mechanisms to constrain them. By resources I mean things like CPU, RAM, Disk, Network and so on. Why does this matter? Well, it turns out that this is critical because you want to use data centers in the most efficient and profitable way. Even though data centers use powerful machines (server-class machines) with lots of CPUs, RAM, DISK, GPUs, TPUs, FPGA, these resources are not infinite nor free.

What about using commodity hardware? These are less capable machines, but they are considerably cheaper and you could use lots of them. Well, this is not very practical because it leads to physical space problems and power consumption, which leads to increased costs and larger carbon footprints. This solution simply does not scale and practice has shown it.

Returning to the need of containers and managing resources, the very notion of public clouds wouldn't have existed today without solutions for resource management and security constraints.

We also cannot talk about containerization without talking about virtual machines (VM). VMs have been around for a long, long time but containers are slightly newer. VMs allow for partitioning the physical machine resources in a way that allows different operating systems (OS) on the same physical machine (aka bare-metal).

Fig 1 : VM vs Container

As shown in the picture above, each VM comes with its own OS. You can have multiple VMs with the same or different OS on the same physical machine. On the other hand, we see that containers share the same OS. Yes, a single OS manages multiple containers, so in this case you are bound to the same OS. Linux is most common in the container world, but Windows also supports containers.

Much like a VM, each container is bound to the system resource allocated, but unlike VMs they are cheaper to manage and significantly faster to start. This is why Containers quickly became so popular. In practice, VMs still play a crucial role for cluster level separation, as one can create clusters using different OS distribution and version. This means that consumers are still insulated from the bare-metal machines.

But what are containers exactly? We've still not answered this question. Without getting into details, Linux containers are based on CGroups, which is a Linux kernel feature that allows limiting system resources that processes can "see". However, almost nobody uses cgroups directly, as there are better technologies that allow us to use this capability, like Docker. Furthermore, there is the Open Container Initiative which is a specification for containerization which, in turn, makes Doker just an implementation. In reality, all cloud providers predominantly support and use Docker. You may have also seen articles that Kubernetes is deprecating the support for Docker. This is because, thanks to OCI (Open Container Initiative), Docker (built on top of containers) can be easily replaced by other container runtimes such CRI-O. The Docker files that everyone is used to are part of OCI, so there is no impact on existing applications. While we are on this topic, Kata Containers are very interesting to look at.

So, how does Kubernetes relate to all this? Isn't Docker enough if we want to deploy containers and manage resources? The short answer is no. Many service architectures rely on micro-services concepts or the solution is composed of various types of components. These components need to communicate with each other, to manage failover, roll updates, control security etc. Managing all these aspects at container level is difficult. Kubernetes allows us to define higher level constructs to better manage deploying, scaling and handling failures. Concepts such as PODs, Services, Jobs, Deployments and more allow us to better modularize and package the software components and the workloads.

Kubernetes resources

There are plenty of materials and online tutorials about Kubernetes (k8s), so describing all these here may seem redundant. However, a brief overview cannot hurt anyone: :)

Pods

A logical grouping of multiple containers that form a single unit. PODs are the atomic deployable construct. It's important to understand that PODs share the same volume or network resources. The containers of a POD are always co-located and deployed together. A POD is described by a yaml file.

Deployments

A Deployment provides a declarative way on how PODs are being created and managed. For instance this is how you'd create multiple replicas of a POD that runs a server. Remember the old days where servers need to scale horizontally to meet the scalability and throughput requirements but also provide high-availability (HA).

Services

Provides a way to expose an application (a server) to be accessible from outside the Kubernetes cluster. A Service is also described by a yaml file.

Jobs

Provide a way to define one or more PODs that perform a certain task and the cluster resources are released. Examples ETL jobs, machine learning training jobs etc

CustomResources (CRs)

Yes, Kubernetes allows you to define your own objects/resources (via yaml file) by first defining a CustomResourceDefinition (CRD) for your resource and then be able to create multiple CR resources. CRDs are to CRs what types are to values in programming languages :) These concepts represents the very foundation of Kube Operator pattern - which I'd urge you to look into if you want to dive in Kubernetes

There are many other types of Kubernetes resources like ReplicaSets, StatefulSets, DaemonSets, Secrets, ConfigMaps, Volumes, PersistentVolumeClaims etc that are beyond the scope of this article.

How do you create and then manage all these things? There are some simple ways:

Use the kubectl tool - a command line tool that you can use to communicate with your Kubernetes cluster. For instance, to create a POD you need:
1. A .yaml file that contains the POD specification
2. Example: kubectl apply -f ./my-pod.yaml
3. Monitor the PODs status: kubectl get pods
Use a programmatic way to communicate with the Kubernetes cluster. This can be done in various ways:
1. Use the Kube REST APIs
2. Use various SDKs available for various languages. Probably the best/complete language binding is for GoLang (guess why). However, there are also clients for Java, Scala, Python Rust and many other languages.

Why does all this matter for AI ?

Now, we are better prepared to answer how this plays a role for AI. If we look at machine learning and AI, no matter what algorithm and techniques you are using, you still need to make them operational. This means a few things:

You need computing resources to perform expensive ML/DL training, to perform hyper parameters optimization and model selection etc. These are extremely compute intensive tasks that require multiple CPUs, lots of RAM, and very often GPU or TPU resources. With a high demand for resources, you need an efficient way to schedule these workloads and run them in a controlled and secure way.
You need compute resources to make your machine learning operational and to monitor the performance of those models. If you deploy your machine learning model as a service, say you have a REST API to predict a certain outcome given a new set of features in real time, you need a system that provides you the ways to manage and scale your service on demand without human intervention.

In short, to make AI operational you need a good way to allocate hardware and software resources, as well as manage security in a coherent and flexible manner.

Models as services (aka online predictions)

Almost all big cloud providers (Google, Amazon, IBM, Azure) provide this function. The idea is simple: deploy a model and use an API (REST/GRPC) to make predictions in near real time. A simplistic architectural view of such service can be:

Fig 2

We highlight 2 cases here:

Multiple models co-located in the same Container so multiple models are kept and managed in memory
A single model exists per container (most commonly used pattern)

The proxy has the role of routing requests to the correct runtime based on the metadata kept (in many cases in ECTD database). It can also perform tasks like authentication (like OpenId Connect- token validation) and/or authorization.

We can also observe that we have multiple runtimes materialized as multiple PODs. This means that we can easily use Python-based, native (C/C++ Rust), JVM-based, R etc runtimes. This is actually important because the landscape of ML frameworks is quite vast. Various programming languages are used and these require different dependencies.

Models as jobs

Things seem simpler in this case because the idea is to:

Read data from a database
Predict each records
Write predictions to a database

Once the job is finalized, the cluster resources are released.

Models for streaming data

The flow here is similar with batch:

Listen for incoming events from a stream of data (i.e. Kafka, AMQP)
Predict each record
Emit a new event that contains the prediction to an output stream of data

Unlike a batch job, the process does not end by itself. It needs to be managed, as this is a continuous process. Therefore, in a way this is also similar to the "Models as service" case.

Conclusion

AI is much more than model training and model inference/scoring. AI applications use various machine learning techniques for solving problems. The users of those applications don't see and don't usually care what is behind the scenes, they only care about the decisions made by the application, decisions that look incredibly similar to those a human would take. The Kubernetes platform is becoming increasingly important for public / private clouds or enterprises from CI/CD perspective, scaling, monitoring, and operational costs - essentially making solutions available to consumers faster and more reliably.