· Engineering  · 3 min read

Performance over Portability? Running Local LLMs on the Asus ProArt 13

Can a thin-and-light PC handle production-level LLMs? We benchmark the Asus ProArt 13 with RTX 4060, the Ryzen AI 9 NPU, and the 8GB VRAM bottleneck.

Featured image for: Performance over Portability? Running Local LLMs on the Asus ProArt 13

The dream of the “AI Resident” is having a powerful model living on your laptop, fully air-gapped, helping you write code while you’re offline. While the MacBook M4 Max gets all the headlines, the Asus ProArt 13 (PX13) represents a different philosophy: a compact Windows machine blending NVIDIA’s CUDA ecosystem with AMD’s new “AI-first” silicon.

But on a device with 8GB of VRAM, that dream comes with a hard ceiling. You aren’t going to be running Llama 3 70B in “pure” mode. To make it a functional tool, you have to play the quantization game and manage your memory budget with precision.

The 8GB VRAM Wall

If you’re coming from the Apple Silicon world, the biggest cultural shock is the shift from “Unified Memory” to “Dedicated VRAM.”

On the Asus ProArt 13, your NVIDIA RTX 4060 has 8GB of GDDR6 memory. Period. While Windows can “offload” overflowing models into system RAM (especially with the ProArt’s 32GB/64GB pool), the performance penalty is massive. Once the model spills out of the GPU’s 8GB bucket and onto the system bus, your tokens per second (t/s) will drop from “fluid reader” to “watching paint dry.”

The Hybrid Engine: Ryzen AI 9 + RTX 4060

The ProArt 13 isn’t just about the GPU. It carries the AMD Ryzen AI 9 HX 370, which introduces a dedicated NPU (Neural Processing Unit) capable of 50 TOPS.

In early 2026, the local LLM stack is starting to leverage this hybridity:

  • RTX 4060 (CUDA): Handles the main weight of the LLM inference.
  • Ryzen NPU: Ideal for background tasks like speech-to-text (Whisper) or smaller embedding models, keeping the GPU free for the “reasoning” heavy lifting.

The Benchmarks: What Actually Runs?

On the ProArt 13 (RTX 4060 8GB), here is the reality of the 2026 model landscape using Ollama:

  • Llama 3.1 8B (Q4_K_M): ~45-50 tokens/second. (Blazing fast, perfect for coding assistants).
  • Mistral NeMo 12B (Q4_K_M): ~22 tokens/second. (The “Sweet Spot” for reasoning vs. speed).
  • DeepSeek R1 14B (Q4_0): ~15 tokens/second. (Functional, but starts to push the VRAM limits).
  • Llama 3 70B (Q4_K_M): ~1.5 tokens/second. (The “Wall.” Too much offloading to system RAM).

For a solo developer, the 8B and 12B models aren’t just toys—they are instantaneous. 45 t/s is faster than you can think.

The Inference Stack: Ollama + WSL2 + CUDA

To get these numbers on the ProArt 13, you can’t just run a basic Python script. The “Gold Standard” stack for Windows in 2026 is Ollama running natively or via WSL2.

  1. CUDA 12.x: Essential for the RTX 4060.
  2. Flash Attention 2: Crucial for keeping context windows snappy without bloating VRAM usage.
  3. K V-Cache Quantization: By quantizing the KV cache to 4-bit or 8-bit, we can fit slightly larger models (like the 14B series) entirely within that 8GB VRAM envelope.

Conclusion: Is it “Pro” yet?

The Asus ProArt 13 is a masterclass in compromise. You lose the massive memory pool of an M4 Max, but you gain the NVIDIA CUDA ecosystem—the native tongue of AI research.

If your workflow involves fine-tuning smaller models, running blazingly fast 8B-14B assistants, or leveraging Windows-only creative tools alongside AI, the ProArt 13 is a superpower. Just don’t expect to run 70B behemoths on a plane. For this machine, the magic isn’t in the size of the model, but in the speed of the interface.

Back to Blog

Related Posts

View All Posts »
Debugging NCCL Ring Failures

Debugging NCCL Ring Failures

When standard tools report a healthy cluster, but your training is stalled, the culprit is often a broken ring topology. We decode specific NCCL algorithms and debugging flags.