Search

· AI Infrastructure  · 9 min read

Multi-Cloud GPU Arbitrage: Routing Workloads Between Hyperscalers and Neoclouds

Don't lock into one vendor. Learn how to use an abstraction layer to route training and inference workloads to the cheapest available capacity across hyperscalers and neoclouds.

Featured image for: Multi-Cloud GPU Arbitrage: Routing Workloads Between Hyperscalers and Neoclouds

Key Takeaways

  • Relying on a single hyperscaler for GPU capacity guarantees you will overpay and eventually hit allocation limits.
  • Multi-cloud GPU arbitrage involves programmatically routing training and inference workloads to the cheapest, most available compute across both traditional hyperscalers (like GCP) and neoclouds (like CoreWeave or Lambda).
  • To execute this, you must aggressively abstract your infrastructure using Kubernetes (GKE, EKS) and standard container registries, treating compute as a pure commodity.
  • The hardest engineering challenge is not the compute routing, but managing data gravity and the hidden costs of egress fees between cloud providers.
  • Implementing a unified orchestration layer allows you to treat spot market preemptions as routing events rather than catastrophic failures.

If you are running a large-scale AI operation, your compute bill is likely the single largest line item in your budget, right after payroll. For the past decade, the standard enterprise playbook was simple: sign a massive commit with a single hyperscaler, take the volume discount, and build deeply into their proprietary ecosystem. You married Google Cloud Platform, or AWS, or Azure.

This strategy is breaking down. The demand for high-end accelerators like H100s and TPUs has created massive allocation bottlenecks. If you are locked into a single provider and they cannot fulfill your quota request, your engineering team stops moving. You are effectively paying a premium for the privilege of waiting in line.

We need to rethink compute. Compute is not a relationship; it is a commodity(though I understand its a very nuanced statement). The rise of specialized GPU neoclouds like CoreWeave and Lambda Labs, combined with the massive spot capacity occasionally available on hyperscalers, has created a fragmented market. If you are willing to abstract your workloads, you can exploit this fragmentation. You can build an architecture that routes your training jobs and inference endpoints to whoever has the cheapest, most available capacity at any given moment. This is multi-cloud GPU arbitrage.

In this walkthrough, we are going to look at how you architect an infrastructure capable of this kind of dynamic routing. We will discuss the required abstractions, the orchestration layer, and the terrifying reality of data egress fees. If you want a deeper look at the specific challenges of spot instances in this context, you should read my breakdown on Spot Market Arbitrage for AI.

Abstracting the Environment

The prerequisite for arbitrage is absolute portability. If your training script relies on a proprietary orchestration tool specific to one cloud provider, you cannot move it. If your inference server relies on a managed service that only exists in AWS, you are trapped.

You must standardize on Kubernetes. Kubernetes is the universal adapter for modern infrastructure. Whether you are spinning up a cluster on Google Kubernetes Engine (GKE), Elastic Kubernetes Service (EKS), or a bare-metal cluster on a neocloud, the API surface remains identical.

Your workloads must be fully containerized. Your training jobs must be packaged as Docker images. Your dependencies, your drivers, and your environment variables must be baked in or injected dynamically at runtime.

# A standard Kubernetes Job definition for a training run.
# Notice there is nothing provider-specific here. This can run on GKE or CoreWeave.
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training-job
spec:
  template:
    spec:
      containers:
        - name: training-container
          image: gcr.io/my-project/training-image:v2.4
          resources:
            limits:
              nvidia.com/gpu: 8 # Requesting 8 GPUs
          env:
            - name: TRAINING_DATA_PATH
              value: 's3://my-universal-bucket/dataset/'
      restartPolicy: Never

When you achieve this level of abstraction, spinning up a workload on a new cloud provider is no longer a migration project; it is just an API call. You point your CI/CD pipeline at a different Kubernetes endpoint, and the exact same job starts running on different silicon.

The Orchestration Layer

Once your workloads are portable, you need a brain to manage the routing. You need an orchestration layer that sits above the individual cloud providers. This orchestrator is constantly polling the APIs of GCP, CoreWeave, and Lambda Labs. It is checking two things: price and availability.

Let us say your team submits a batch inference job that requires 64 GPUs for four hours.

The orchestrator checks the spot pricing on GCP. It might be 2.00perhour,buttheavailabilityislow,andtheriskofpreemptionishigh.ItchecksCoreWeave.Theondemandpriceis2.00 per hour, but the availability is low, and the risk of preemption is high. It checks CoreWeave. The on-demand price is2.50 per hour, but they have immediate availability. It checks Lambda. The price is $1.80 per hour, and they have the capacity.

The orchestrator dynamically provisions a Kubernetes cluster on Lambda, deploys the inference containers, executes the job, retrieves the results, and tears the cluster down. The engineering team that submitted the job never knows, and never cares, where the compute actually happened. They just get their results back faster and cheaper.

Building this orchestrator from scratch used to require a robust custom control plane, heavily relying on tools like Crossplane or custom Terraform operators. However, this is a notoriously difficult engineering problem.

Fortunately, an ecosystem of ISVs and open-source platforms has emerged specifically to solve this routing challenge. You no longer have to build the polling and provisioning engine yourself:

  • SkyPilot: An open-source framework (originally out of UC Berkeley) that abstracts away cloud infrastructure. It automatically finds the cheapest zone, region, or cloud for your job, manages spot instance retries, and supports autostop for idle resources.
  • dstack: A lightweight orchestration tool focused on simplifying AI/ML execution across various clouds and on-prem hardware without needing deep Kubernetes expertise.
  • Run.ai: An enterprise-grade platform focused on GPU orchestration and virtualization within Kubernetes, pooling resources across hybrid and multi-cloud environments.

Implementation Example: Routing with SkyPilot

To make this concrete, let’s look at how you implement multi-cloud arbitrage using SkyPilot. Instead of writing custom Terraform for AWS, GCP, and CoreWeave, you define your workload in a single declarative YAML file (task.yaml):

name: distributed-training-job

resources:
  accelerators: A100:8 # Requesting 8x A100 GPUs
  use_spot: true # Aggressively bid on spot instances to maximize savings

setup: |
  pip install -r requirements.txt

run: |
  python train.py --data s3://my-universal-bucket/dataset/

To execute this across your connected clouds, you simply run:

sky launch task.yaml

Behind the scenes, SkyPilot evaluates the real-time spot pricing and availability across all connected providers (AWS, GCP, Azure, Lambda Labs, etc.). It automatically provisions the cheapest available instances, configures the environment, syncs the code, runs the job, and tears down the infrastructure when the training completes. If a preemption occurs, the framework can automatically retry the job on the next cheapest available cloud, treating spot preemption as a standard routing event rather than a critical failure.

The Gravity of Data (And Egress Fees)

This is where the dream of multi-cloud arbitrage usually hits a brick wall. Compute is easy to move. Data is incredibly heavy, and moving it is financially punitive.

If you store a 50-terabyte dataset in a Google Cloud Storage (GCS) bucket, and you decide to route your training job to CoreWeave because the GPUs are cheaper, you have to pull that 50TB of data out of GCP. The egress fees you will pay to move that data across the public internet will instantly vaporize any savings you gained from the cheaper GPUs.

This is the hidden tax of the hyperscalers. They make it cheap to bring your data in, and exorbitantly expensive to take it out. For more on the specific networking bottlenecks this creates, see my post on Network Design for AI Workloads.

To solve this, you must decouple your storage from your compute.

Option 1: The Neutral Storage Hub You host your massive datasets in a neutral colocation facility (like Equinix) or a specialized storage provider that does not charge punitive egress fees. You then establish dedicated, high-speed interconnects (like Google Cloud Interconnect or AWS Direct Connect) from this neutral hub to the various cloud providers. When a training job spins up on GCP, it pulls the data over the dedicated line. When it spins up on a neocloud, it pulls the data over their respective interconnect. You pay for the dedicated line, but you avoid the variable, unpredictable internet egress fees.

Option 2: Regional Caching If setting up dedicated hardware is too heavy, you must architect an aggressive caching layer. If you decide to route a series of jobs to CoreWeave, you pay the egress fee to move the dataset once. You store it locally on the neocloud’s NVMe storage. You then run multiple training iterations against that cached dataset to amortize the initial egress cost. Your orchestrator must be smart enough to know where the data currently lives and factor that into the routing decision. If the data is already on GCP, a slightly more expensive GPU on GCP is mathematically cheaper than a cheaper GPU on CoreWeave plus the data transfer fee.

Preemption as a Routing Event

When you are playing the arbitrage game, you are heavily reliant on spot instances. Spot instances are spare capacity that the cloud provider sells at a massive discount, with the caveat that they can take the GPUs back at any moment with only a few seconds of warning.

In a traditional architecture, a preemption is a failure. The job crashes, and an engineer gets paged.

In an arbitrage architecture, a preemption is just a routing event. You design your training loops with aggressive, highly frequent checkpointing. You save the model state to your distributed storage every few minutes.

When GCP sends a preemption signal to your GKE node, your orchestrator catches the signal. It gracefully shuts down the container, saving the final checkpoint. It then immediately scans the market. If GCP spot capacity is gone, it looks at CoreWeave. It provisions a new node, pulls the latest checkpoint, and resumes the training loop exactly where it left off.

The system heals itself. The workload moves seamlessly across the fragmented market, chasing the cheapest compute, oblivious to the underlying hardware changes.

The Future of Compute Procurement

The era of the single-vendor lock-in for AI compute is ending. The margins are too tight, and the hardware scarcity is too severe.

By abstracting your workloads to Kubernetes, decoupling your storage to manage egress, and building an intelligent orchestration layer, you can turn the fragmented GPU market into a massive competitive advantage. You stop begging your account manager for allocation and start treating compute like the fluid, tradable commodity it is. It is a complex engineering challenge, but in current times, it is the only way to scale without breaking the bank.

Back to Blog

Related Posts

View All Posts »