From Random to Reliable

From Random to Reliable by Dinesh Dec. 9, 2024

The Project

I was working on a project called Distributed Hyperparameter Tuning at the Edge (DHPT), aimed at building a serverless framework for distributed hyperparameter tuning on resource-constrained edge devices using Fission. We used Jetson Nano, Jetson Orin and Raspberry Pi in a K3s cluster.

Goal

The goal was to run models like Random Forest, XGBoost and AutoEncoders with algorithms such as Bayesian Optimization, Random Search, Grid Search, Tree Parzen Estimator and Stochastic Hill Climbing using our framework, and compare metrics such as computation time, memory usage, CPU, GPU utilization and R-squared when performed with the framework versus sequentially.

Problem

AutoEncoders require GPU access, while Random Forest and XGBoost can run efficiently on CPUs. The Orin and Nano are equipped with GPUs, but the Raspberry Pis does not have one. When deployed as pods, the Kubernetes scheduler places each pod on any available node based on resource availability, which can result in GPU dependent pods being scheduled on Raspberry Pis leading to deployment failures.

Solution

After brainstorming some crazy ideas like writing our own scheduler or changing the architecture to implement a Flask based service to distribute workloads, we finally stumbled upon the true savior: Node Affinity.

According to the official Kubernetes documentation (https://kubernetes.io/), Node Affinity is a feature that allows users to constrain which nodes a pod can be scheduled on, based on node labels.

It also includes features for declaring scheduling preferences like:-

requiredDuringSchedulingIgnoredDuringExecution (which ensure pods are only scheduled on nodes matching the defined labels)
preferredDuringSchedulingIgnoredDuringExecution (which allows pods to be scheduled on other nodes if no matching nodes are available)

Since we need the AutoEncoder pods to run on the nodes with a GPU, we used the former for implementation.

Implementation

We manually labeled the Orin nodes as gpu: orin (as shown in the figure) and did the same for the Nano nodes, labeling them gpu: nano.

With these labels in place, we updated our deployment configuration to include Node Affinity. As shown in the figure below, we specified that the AutoEncoder pods should be scheduled only on nodes with the gpu: orin or gpu: nano labels.

The Magic

All the pods are up and running, ensuring that the AutoEncoder pods are deployed on the Orin and Nano devices, while avoiding the Raspberry Pi nodes.

This is how Node Affinity helped us schedule GPU workloads efficiently, saving the day without adding complexity or requiring architecture changes.