AI News HubLIVE
站内改写

You can plug a Thunderbolt eGPU into a Mac to play games and accelerate AI now

This article details how to connect a desktop NVIDIA RTX 5090 GPU to an M4 MacBook Air via Thunderbolt, and use PCI passthrough in a Linux VM to enable gaming and AI inference. It covers technical challenges including lack of macOS drivers, kernel panics from PCI BAR mapping, DMA issues, and provides benchmarks.

Article intelligence

EngineersAdvanced

Key points

  • Use Thunderbolt to connect external GPU to Mac, requires PCI passthrough to Linux VM.
  • macOS on Apple Silicon lacks NVIDIA/AMD GPU drivers, rely on Hypervisor.framework.
  • Technical hurdles include kernel panics from PCI BAR mapping and DMA configuration.
  • Benchmarks show good performance in 1080p gaming and AI inference, though below native desktop.

Why it matters

This matters because use Thunderbolt to connect external GPU to Mac, requires PCI passthrough to Linux VM.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Strap a 600W GPU to your 22W CPU

Table of Contents

Never tell me the odds

What’s a Thunderbolt eGPU?

What about tinygrad?

The existing Linux driver

Engineering PCI Passthrough on macOS

PCI device basics

Mapping PCI BARs

DMA

DMA on Apple Silicon

apple-dma-pci

NVIDIA alignment quirk

Coalescing mappings

Other performance concerns

Scheduling

Total store ordering

Benchmarks

CPU comparison

Cyberpunk 2077

720p Low

1080p

4K

Takeaways

GravityMark

Shadow of the Tomb Raider

Horizon Zero Dawn Remastered

Doom (2016)

Can it run Crysis?

AI Inference

Qwen 3.6

Gemma 4

Can I run this?

Get notified

Conclusion

Follow-on

Credits

What if you could strap a full desktop GPU to your MacBook Air? Turns out, you can.

Just a quick FTC required note: When you buy through my links, I may earn a commission.

Never tell me the odds#

As much as I hate to admit it, step one in most of my projects now is to ask AI about it. Maybe it’ll tell me something I don’t know.

Fortunately, borderline-impractical is kind of my thing.

What’s a Thunderbolt eGPU?#

Ok, so the plan is to plug a big PC gaming GPU, an NVIDIA RTX 5090, into my M4 MacBook Air. To do that, we plug it into a Thunderbolt dock which adapts PCIe to Thunderbolt, and we plug that into a USB-C port.

Thunderbolt tunnels PCIe over a USB-C cable, so from the computer’s perspective a Thunderbolt device really is a PCIe device, not a USB one. You get 4 PCIe lanes at up to 40Gbps on Thunderbolt 4, with a small performance penalty for the tunneling. USB4 includes the same PCIe tunneling as an optional feature, so some non-Thunderbolt USB4 ports can do this too. You can use this to plug a GPU into a laptop with a compatible port.

Thunderbolt from the laptop plugs into the GPU dock. The GPU plugs into the monitor via DisplayPort. Shortly after this was taken, I broke this dock.

From the computer’s perspective, the device looks more or less like a slightly slower PCIe device, so you can usually use the same drivers you’d normally use for those devices. eGPUs work pretty much out of the box on Linux and Windows. It’s even possible to use one on a Raspberry Pi (albeit with Oculink, not Thunderbolt).

The first hurdle is that macOS does not ship with drivers for NVIDIA or AMD GPUs on Apple Silicon.

What about tinygrad?#

tinygrad recently released their own macOS eGPU drivers. It’s a whole new AI stack with its own open source driver pipeline for NVIDIA and AMD hardware.

Sadly, if your main objective is to run AI inference or play games, tinygrad probably isn’t the solution you’re looking for. This video by YouTuber Alex Ziskind shows that using an eGPU via tinygrad for inference is about 10 times slower than running native Metal inference directly on an M4 Pro without an eGPU. You can only use the tinygrad eGPU driver with the tinygrad stack, not for anything else. It also has very limited support for different AI models.

Getting NVIDIA PTX code running on the GPU is one thing. Writing a full general-purpose display driver that works with arbitrary software is a significantly harder problem. So for now, what can you actually do with an eGPU and a Mac?

The existing Linux driver#

Linux can run on Apple Silicon Macs now. Regrettably, at this time, the Linux kernel does not support Thunderbolt on Apple Silicon (only internal devices and USB3). But…

You can run Linux in a 64-bit ARM VM on a macOS host. macOS supports Thunderbolt devices. Linux supports NVIDIA GPUs. Let’s put the pieces together and pass through the GPU into the Linux VM.

At a high level, we’re just going to put the GPU in the Linux VM. The VM is the same architecture as the Mac host (arm64), so performance should be comparable. Of course, the devil is in the details.

There is no driver for NVIDIA cards on ARM64 Windows. That’s why we use Linux.

For a quick video demo of the result, take a look:

In the rest of the post, I’ll go through the long and winding road of getting this to actually work. If you just want to see screenshots and benchmarks, you can probably skip to the benchmark section.

Engineering PCI Passthrough on macOS#

PCI device basics#

Let’s look at two things we need working for the VM to talk to the PCI device:

PCI BAR (Base Address Registers) - Each PCI device communicates through chunks of memory that the computer can read and write to. There’s basically a reserved region of memory on your computer for each device. Those memory regions have to be mirrored into the VM for PCI passthrough to work.

DMA (Direct Memory Access) - This is how the device can read and write information directly in/out of your computer’s memory. Instead of having the CPU burn cycles copying data from the device, the device can copy the memory automatically. For a GPU, it might be used to copy textures directly from the computer’s memory into its own video memory.

Mapping PCI BARs#

When QEMU starts a VM, it sets up the guest’s memory layout. For normal RAM, this boils down to a call to hvf_set_phys_mem() in QEMU, which uses the Hypervisor.framework method:

hv_vm_map(mem, guest_physical_address, size, HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC);

Next, we connect to the host PCIDriverKit driver and ask to map the memory from the PCI device into our process. (I’m leaving the driver-side code out for now, but it’s very similar boilerplate.)

// map BAR0 into the current process and set addr to the location // where it was mapped mach_vm_address_t addr = 0; mach_vm_size_t size = 0; IOConnectMapMemory64(driverConnection, 0, mach_task_self(), &addr, &size, kIOMapAnywhere);

Ok, so then we have addr, which now points to the BAR0 memory that we can access directly in our process. At this point you can just read and write stuff to it, like any other piece of memory.

volatile uint32_t *bar0 = (volatile uint32_t *)addr; printf("BAR0[0] = %x\n", bar0[0]); // this would output: BAR0[0] = 0x1b2000a1 // which is a device-specific constant that describes my RTX 5090 // // BAR0[0] is the BOOT_0 register. The fields break down as: // arch = 0x1b → GB200 GPU family // impl = 0x2 → GB202 die (RTX 5090) // major_rev = 0xa → stepping A // minor_rev = 0x1 → revision 1 (together: stepping A1)

Now we just make sure QEMU calls hvf_set_phys_mem() for our device memory, and we can map that into the guest. When guest code touches that mapping, it talks directly to the GPU with minimal host overhead. This is the best case for performance. At least, in theory.

In practice, as soon as the VM touched the PCI BAR memory, the host kernel crashed.

If you’ve never experienced this before, it’s disorienting. Your entire computer will hang, and because the trackpad feedback is controlled by software, suddenly the trackpad will no longer click. The dogs and cats in your neighborhood start howling. Pictures fall off the walls of your house. Eventually your computer will reboot, and you will be presented with this dialog.

Ok, so we can’t map device memory directly, but we have other tricks up our sleeve. We can trap every access to the memory, exit the guest back into QEMU, and have QEMU forward each read or write to the device. That keeps behavior correct, but it’s brutally slow. In many workloads the pain is elsewhere. Most of the performance-sensitive work is DMA, but some paths still care how fast you can push commands through the BAR.

I started preparing a bug report for Apple and wrote a small reproduction (well, AI-assisted) to demonstrate the issue:

#include #include #include #include #include #include

#define FAIL(code) do { result->status = (code); goto cleanup; } while (0)

#define HV_CHECK(expr, code) do { \ if ((expr) != HV_SUCCESS) FAIL(code); \ } while (0)

#define PREFETCHABLE_MASK 0x08 #define SELECTOR_GET_BAR_INFO 10 #define GUEST_CODE_IPA 0x4000ULL #define GUEST_BAR_IPA 0x10000000ULL

static const uint32_t prog_read[] = { 0xf9400001, /* ldr x1, [x0] */ 0xd4000002, /* hvc #0 */ 0xd4200000, /* brk #0 */ };

int vfio_guest_bar_touch_run(io_connect_t connection, uint8_t bar, VFIOGuestBarTouchResult *result) { size_t page = (size_t)sysconf(_SC_PAGESIZE); void *code = NULL; bool vm_up = false, vcpu_up = false, bar_mapped = false; hv_vcpu_t vcpu = 0; hv_vcpu_exit_t *exit_info = NULL; mach_vm_address_t bar_addr = 0; mach_vm_size_t bar_size = 0;

memset(result, 0, sizeof(*result));

uint64_t bar_in[1] = { bar }; uint64_t bar_out[3] = {0}; uint32_t bar_cnt = 3; if (IOConnectCallMethod(connection, SELECTOR_GET_BAR_INFO, bar_in, 1, NULL, 0, bar_out, &bar_cnt, NULL, NULL) != KERN_SUCCESS) { FAIL(VFIO_GUEST_BAR_TOUCH_MAP_BAR_FAILED); } result->barType = (uint8_t)bar_out[2];

IOOptionBits opts = kIOMapAnywhere; if (result->barType & PREFETCHABLE_MASK) opts |= kIOMapWriteCombineCache;

if (IOConnectMapMemory64(connection, 1u + bar, mach_task_self_, &bar_addr, &bar_size, opts) != KERN_SUCCESS) { FAIL(VFIO_GUEST_BAR_TOUCH_MAP_BAR_FAILED); } bar_mapped = true; result->hostBARAddress = (uint64_t)bar_addr; result->mappedSize = (uint64_t)bar_size;

if (page == 0 || (bar_size % page) != 0) FAIL(VFIO_GUEST_BAR_TOUCH_MAP_BAR_FAILED);

if (posix_memalign(&code, page, page)) FAIL(VFIO_GUEST_BAR_TOUCH_ALLOC_FAILED);

memset(code, 0, page); memcpy(code, prog_read, sizeof(prog_read)); sys_icache_invalidate(code, page);

HV_CHECK(hv_vm_create(NULL), VFIO_GUEST_BAR_TOUCH_HV_VM_CREATE_FAILED); vm_up = true;

HV_CHECK(hv_vm_map(code, GUEST_CODE_IPA, page, HV_MEMORY_READ | HV_MEMORY_WRITE), VFIO_GUEST_BAR_TOUCH_HV_MAP_CODE_FAILED);

HV_CHECK(hv_vm_map((void *)(uintptr_t)bar_addr, GUEST_BAR_IPA, (size_t)bar_size, HV_MEMORY_READ | HV_MEMORY_WRITE), VFIO_GUEST_BAR_TOUCH_HV_MAP_BAR_FAILED);

HV_CHECK(hv_vcpu_create(&vcpu, &exit_info, NULL), VFIO_GUEST_BAR_TOUCH_HV_VCPU_CREATE_FAILED); vcpu_up = true;

hv_vcpu_set_reg(vcpu, HV_REG_PC, GUEST_CODE_IPA); hv_vcpu_set_reg(vcpu, HV_REG_X0, GUEST_BAR_IPA); hv_vcpu_set_reg(vcpu, HV_REG_CPSR, 0x3c5);

HV_CHECK(hv_vcpu_run(vcpu), VFIO_GUEST_BAR_TOUCH_HV_VCPU_RUN_FAILED);

result->exitReason = exit_info->reason; result->syndrome = exit_info->exception.syndrome; result->virtualAddress = exit_info->exception.virtual_address; result->physicalAddress = exit_info->exception.physical_address; hv_vcpu_get_reg(vcpu, HV_REG_PC, &result->programCounter); hv_vcpu_get_reg(vcpu, HV_REG_X0, &result->x0); hv_vcpu_get_reg(vcpu, HV_REG_X1, &result->x1);

cleanup: if (vcpu_up) hv_vcpu_destroy(vcpu); if (vm_up) hv_vm_destroy(); if (bar_mapped) IOConnectUnmapMemory64(connection, 1u + bar, mach_task_self_, bar_addr); free(code); return result->status; }

In ~100 lines of C, you can spin up a VM, map the device BAR into the guest, and run code that touches it. I’m still not sure whether that was more frustrating or encouraging, but that version ran without crashing, while QEMU was still panicking the host. I was stumped for a while. Was it the guest page tables? Was the BAR colliding with guest RAM in some subtle way? Why were the dogs and cats still howling?

Eventually, in my desperation, I asked an AI coding assistant to compare my sample and QEMU. It immediately flagged that my mapping used HV_MEMORY_READ | HV_MEMORY_WRITE while QEMU used HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC. Alas, bested again by AI. Not even silly blog projects are safe anymore (mostly kidding).

The workaround in QEMU was a small change:

diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c index 5f357c6d19..76cec4655b 100644 --- a/accel/hvf/hvf-all.c +++ b/accel/hvf/hvf-all.c @@ -114,7 +114,15 @@ static void hvf_set_phys_mem(MemoryRegionSection *section, bool add) return; }

  • flags = HV_MEMORY_READ | HV_MEMORY_EXEC | (writable ? HV_MEMORY_WRITE : 0);

+ flags = HV_MEMORY_READ | (writable ? HV_MEMORY_WRITE : 0); + /* + * Leave RAM-device/MMIO mappings RW-only: on macOS, accessing them throug

[truncated for AI cost control]