· AI at Scale  · 8 min read

Getting most out of your GPUs using MIG

Understanding how to partition a single GPU into multiple isolated instances for cost-efficient AI workloads, with a deep dive into NVIDIA's MIG technology and the architectural differences between GKE and EKS..

Understanding how to partition a single GPU into multiple isolated instances for cost-efficient AI workloads, with a deep dive into NVIDIA's MIG technology and the architectural differences between GKE and EKS..

Fractional GPUs in the Cloud

Introduction

NVIDIA RTX PRO 6000 Blackwell is quite a beast, It delivers 125 TFLOPS of FP32 compute and a massive 96GB of GDDR7 memory. For training frontier models, you need every ounce of that power.

But what happens when your workload isn’t a frontier model?

  • What about a developer who needs a GPU for 5 minutes of unit testing?
  • What about running a 7B parameter Llama-3 model that only consumes 16GB of VRAM?
  • What about a microservice that runs inference on occasional user uploads?

In these scenarios, assigning a full RTX PRO 6000 is economically indefensible. You are utilizing perhaps 10-15% of the hardware while paying for 100% of it. The remaining 85% isn’t just idle; it’s wasted capital.

Fractional GPU solves this problem by partitioning a single physical GPU into multiple, completely isolated virtual instances. Google Cloud’s G4 VMs support this through NVIDIA’s Multi-Instance GPU (MIG) technology.

This guide covers:

  1. The architecture of hardware-level GPU isolation.
  2. Correct implementation on Google Cloud G4 VMs (including critical RTX-specific steps).
  3. A comparative analysis of MIG on GKE vs. Amazon EKS.
  4. Real-world sizing strategies.

The Problem with Traditional GPU Sharing

Historically, if you wanted to share a GPU between two applications, you had to rely on Time-Slicing (Temporal Sharing).

The “Noisy Neighbor” Trap

Time-slicing works like a CPU scheduler: Process A runs for a few milliseconds, pauses, and then Process B runs. While this provides concurrency, it does not provide isolation.

  • Memory Risk: Both processes share the same Global Memory address space. If Process A has a memory leak and causes an Out-Of-Memory (OOM) error, the entire GPU context crashes, killing Process B instantly.
  • Latency Jitter: If Process A initiates a heavy compute kernel, Process B has to wait in line. This creates unpredictable latency spikes (P99 latency), which is unacceptable for real-time inference.

Hardware Partitioning

We don’t want to share the GPU; we want to slice it. We need the silicon to behave as if it were physically cut into smaller pieces, each with its own dedicated memory, cache, and compute cores. This is exactly what NVIDIA MIG delivers.

NVIDIA MIG Architecture Deep Dive

MIG is not software virtualization (like vGPU); it is a physical partitioning of the GPU’s internal resources.

When you configure MIG on an RTX PRO 6000, the hardware isolates the internal data paths to ensure total predictability.

Multi Instance Group

1. Compute Isolation (SMs)

The GPU’s Streaming Multiprocessors (SMs) are hard-assigned to specific instances. If the GPU has 142 SMs and you create a partition using 25% of the device, exactly 35 SMs are dedicated to that partition. A workload running in Partition A physically cannot execute instructions on the SMs belonging to Partition B.

2. Memory Isolation

The 96GB of GDDR7 memory is addressed in contiguous blocks. Partition A might own addresses 0x0000 to 0x2000. It is physically impossible for it to read or write to Partition B’s memory range. This provides fault isolation—if Partition A crashes or segfaults, Partition B keeps running without a hiccup.

3. L2 Cache Partitioning

This is the “secret sauce” of MIG performance. The L2 cache is physically split. This prevents Cache Thrashing, where a memory-intensive job in one partition evicts the hot data of a job in another partition. This guarantees that your inference latency remains stable, regardless of what the other “neighbors” on the GPU are doing.

Supported Profiles

Unlike the A100 which supports up to 7 slices, the RTX PRO 6000 supports up to 4. The naming convention follows the pattern:g.gb.

ProfileCompute ShareMemoryUse Case
1g.24gb1/4 (25%)24 GBLight inference (Llama-7B), Dev, CI/CD
2g.48gb2/4 (50%)48 GBMedium training, Large inference (Llama-70B 4-bit)
3g.72gb3/4 (75%)72 GBHeavy workloads + sidecar monitor
4g.96gb4/4 (100%)96 GBFull GPU (MIG disabled)

Implementing MIG on Google Cloud G4 VMs

The Google Cloud G4 VMs, powered by RTX PRO 6000 Blackwell GPUs, support up to 4 MIG instances.

1. Enable MIG Mode Once the VM comes back online, enable MIG at the driver level.

sudo nvidia-smi -i 0 -mig 1

# Reset the GPU to apply the change (alternative to another reboot)
sudo nvidia-smi --gpu-reset

2. Create Instances (The Efficient Way) You create instances using “Profiles.” The RTX PRO 6000 uses profiles like 1g.24gb (1 compute slice, 24GB memory).

  • Note: Use the -C flag to create the Compute Instance automatically. Without this, you only create the “shell” (GPU Instance) and have to run a second command to fill it.
# Create four 24GB partitions atomically
sudo nvidia-smi mig -i 0 -cgi 1g.24gb,1g.24gb,1g.24gb,1g.24gb -C

# Verify the creation
nvidia-smi -L

Output:

GPU 0: NVIDIA RTX PRO 6000 (UUID: GPU-...)
  MIG 1g.24gb Device 0: (UUID: MIG-...)
  MIG 1g.24gb Device 1: (UUID: MIG-...)
  MIG 1g.24gb Device 2: (UUID: MIG-...)
  MIG 1g.24gb Device 3: (UUID: MIG-...)

GKE vs. EKS - A quick comparison

While MIG is an NVIDIA technology, how you consume it depends heavily on your cloud provider. This is where Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS) diverge significantly in philosophy.

The Core Difference: Mutable vs. Immutable Infrastructure

1. The Amazon EKS Approach (The “Operator” Pattern)

EKS treats MIG support as an add-on layer on top of a standard node. It follows a “Day 2 Operation” workflow.

  • Philosophy: “Here is a standard GPU node. You (the user) install software to reconfigure it.”
  • The Workflow:
  1. Provision a standard GPU node (e.g., p4d.24xlarge).
  2. Install the NVIDIA GPU Operator.
  3. Apply a Kubernetes label to the node: kubectl label node x nvidia.com/mig.config=all-1g.5gb.
  4. The Reboot Sequence: The GPU Operator detects the label, cordons the node (evicts pods), interacts with the driver to partition the GPU, and reboots the node.
  5. 10 minutes later, the node rejoins the cluster with MIG enabled.
2. The Google GKE Approach (The “API” Pattern)

GKE treats MIG configuration as a first-class citizen of the Compute Engine API. It follows an “Immutable Infrastructure” workflow.

  • Philosophy: “Tell us what hardware configuration you want, and we will boot it that way.”
  • The Workflow:
  1. Create a Node Pool with a single flag: --accelerator=type=nvidia-rtx-pro-6000,count=1,gpu-partition-size=1g.24gb.
  2. Google’s control plane configures the silicon before the OS boots.
  3. The node joins the cluster instantly ready to accept workloads.

Comparative Analysis Table

FeatureGoogle GKEAmazon EKS
ConfigurationAPI-Driven: Defined in Node Pool config.Label-Driven: Defined via K8s labels + GPU Operator.
Startup TimeFast: Node boots ready-to-use.Slow: Node boots -> Operator acts -> Reboot -> Ready.
Auto-scalingSmooth: New nodes arrive ready for pods.Laggy: New nodes flap (Up/Down) causing scheduling delays.
ComplexityLow: Managed by GKE control plane.High: User manages the GPU Operator versioning.
FlexibilityRigid: Defined at Node Pool creation.Flexible: Can re-label nodes to change profiles (requires drain).

GKE’s approach is significantly helpful for production environments that rely on Cluster Autoscaler. In EKS, if traffic spikes and a new node is provisioned, that node spends the first 10 minutes performing the “reboot dance” to enable MIG. In GKE, the node arrives ready to serve traffic immediately.

MIG on GKE - Configuration Guide

Since GKE handles the heavy lifting, your configuration is simple.

1. Creating the Node Pool

You define the partition size at the infrastructure level.

gcloud container node-pools create mig-pool \
    --cluster=my-cluster \
    --machine-type=g4-standard-48 \
    --accelerator=type=nvidia-rtx-pro-6000,count=1,gpu-partition-size=1g.24gb \
    --num-nodes=3

2. Scheduling Workloads

Your Pods request the specific MIG slice as a resource limit.

apiVersion: v1
kind: Pod
metadata:
  name: inference-server
spec:
  containers:
  - name: model
    image: my-model:latest
    resources:
      limits:
        # Request exactly one 24GB slice
        nvidia.com/mig-1g.24gb: 1

Use Cases

1. The “Right-Sized” Inference Cluster

Scenario: You have a mix of models: a large heavy-duty LLM and several smaller BERT-based classifiers.

  • Strategy: Create two GKE Node Pools.

  • pool-large: Uses 2g.48gb partitions for the LLMs.

  • pool-small: Uses 1g.24gb partitions for the classifiers.

  • Result: You achieve nearly 100% bin-packing efficiency. The classifiers don’t hog a full GPU, and the LLMs have the memory guarantees they need.

2. CI/CD Pipeline Acceleration

Scenario: A nightly build runs 500 integration tests that need GPU access.

  • Strategy: Use MIG to run 4 tests in parallel per physical GPU.
  • Result: You cut your hardware costs by 75% and your test suite runtime by 4x (concurrency). Because of MIG’s fault isolation, if Test A crashes the video driver, Test B running in the adjacent partition is unaffected.

3. Multi-Tenant SaaS

Scenario: You provide a platform where users can deploy their own custom models.

  • Strategy: Assign each user a dedicated MIG partition.
  • Result: You can guarantee performance SLAs. User A cannot cause “noisy neighbor” latency spikes for User B because they are physically separated by the memory controller and L2 cache partitioning.

Limitations and Best Practices

While MIG is powerful, it is not a magic bullet. It does have some constraints:

  1. No Dynamic Resizing: You cannot change a partition from 24GB to 48GB on the fly. You must drain the node and recreate the instances (or in GKE, move the workload to a different Node Pool).
  2. No P2P / NVLink: MIG instances cannot communicate with each other directly. You cannot run distributed training (using NCCL) across multiple MIG slices on the same GPU. If you need multi-GPU training, disable MIG.
  3. Profiling is Different: Standard tools like nvidia-smi give confusing results in MIG mode. Use the NVIDIA DCGM Exporter in Kubernetes to get accurate per-instance metrics for observability.
  4. Hardware Overhead: Enabling MIG consumes a small amount of memory for the management layer (a few hundred MBs), so a 24GB slice effectively gives you ~23GB of usable VRAM.

Conclusion

Fractional GPUs represent a paradigm shift in cloud economics. By moving from monolithic GPU allocation to granular, isolated partitioning, organizations can drastically increase utilization and reduce costs.

While both GKE and EKS support this technology, GKE’s implementation stands out for its operational simplicity. By abstracting the MIG configuration into the Compute Engine API, GKE eliminates the “reboot loops” and complex operator management required on other platforms, making it the premier choice for auto-scaling fractional GPU workloads.

For the RTX PRO 6000 Blackwell, the math is simple:

  • Without MIG: One expensive GPU serves one small workload.
  • With MIG: One expensive GPU serves four workloads with hardware-guaranteed isolation.

Next Steps: Audit your current GPU utilization. If your average VRAM usage is under 24GB per pod, create a test GKE Node Pool with gpu-partition-size=1g.24gb and migrate your dev workloads. The cost savings will likely pay for the engineering time in the first month.

Back to Blog

Related Posts

View All Posts »
Layered improvements with G4 / RTX 6000 Pro

Layered improvements with G4 / RTX 6000 Pro

Google Cloud’s G4 architecture delivers 168% higher throughput by maximizing PCIe Gen 5 performance. This deep dive examines the engineering stack driving these gains, from direct P2P communication...

Why do large enterprises need a Chief AI Officer?

Why do large enterprises need a Chief AI Officer?

As organizations pivot from AI experimentation to enterprise-scale deployment, a recurring structural friction often emerges. Through my engagements with leadership teams in APAC, it has become clear...

Network Design for AI Workloads

Network Design for AI Workloads

Generative AI has shifted data center traffic patterns, making network performance the new bottleneck for model training. This post contrasts how the "Big Three" cloud providers utilize distinct...

Not All Zeros Are the Same - Sparsity Explained

Not All Zeros Are the Same - Sparsity Explained

Demystifying hardware acceleration and the competing sparsity philosophies of Google TPUs and Nvidia. This post connects novel architectures, like Mixture-of-Experts, to hardware design strategy and...