2026-06-20 22:16 UTCIn-site rewrite6 min readUpdated: 2026-06-21 23:31 UTC

Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)

A detailed guide on how to run AI inference on an AMD RX 580 using Vulkan, bypassing CUDA and ROCm, with benchmarks, dual-path architecture, and step-by-step setup for LLMs, image generation, audio transcription, and voice cloning.

SourceHacker News AIAuthor: aivisionslab

Notifications You must be signed in to change notification settings

Fork 0

Star 1

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

20 Commits

content

docs

public

scripts

src

.env.example

.gitattributes

.gitignore

GITHUB_RELEASE_POST.md

LICENSE

README.md

convert-og.js

firebase.json

generate_ssg.js

index.html

metadata.json

migrate_to_md.js

package-lock.json

package.json

tsconfig.json

vite.config.ts

vulkan-diagnostic.bat

vulkan-diagnostic.sh

Repository files navigation

GPU from 2017. SOTA AI in 2026. No CUDA. No ROCm. No cloud. No excuses.

The Narrative

"Your RX 580 can't run AI. Buy a new GPU."

AMD dropped ROCm for Polaris/GCN4 in v5.x. DirectML crashes with OpaqueTensorImpl. OpenVINO fails silently on Forge. The mainstream AI stack gave up on this card.

We didn't.

By compiling llama.cpp and stable-diffusion.cpp from source with Vulkan support, the RX 580 runs real, useful AI inference in 2026 — locally, offline, privately. This repository is the complete technical record of how.

RX 580 8GB ──► Vulkan API ──► ggml engine ──► 17 tok/s LLM + 72s/image SD Xeon 2014 ──► WSL2 CPU ──► ComfyUI ──► FLUX 16GB + AnimateDiff

Table of Contents

Hardware

Benchmarks

Architecture: Dual-Path Stack

Critical: Two GGUF Formats for FLUX

What Failed (and Why)

Quick Start: LLM via Vulkan

Quick Start: Image Generation via Vulkan

FLUX Hybrid Setup

OpenWebUI + Docker Integration

whisper.cpp: Audio Transcription

Applio RVC: Voice Cloning

AnimateDiff: Video Generation

Linux Native: Ubuntu 26.04 LTS

Windows vs Linux Comparison

Troubleshooting

Automation Scripts

Community Timeline

Pushing the 35B Limit: Qwen3.5 MoE Hybrid Experiment

Repository Structure

Hardware

Component Spec

GPU AMD RX 580 2048SP 8GB GDDR5 (Polaris / GCN4)

CPU Intel Xeon E5-2690 v3 — 12c/24t · 3.5GHz (2014)

RAM 32GB DDR4 REG ECC Quad Channel

Storage NVMe 1TB — 1.7–3.5 GB/s read

OS Windows 10 Pro + WSL2 Ubuntu 22.04.5 / Ubuntu 26.04 LTS

AMD Driver 31.0.21924.61 (Amdnolk, Nov 2025)

Vulkan SDK 1.4.341.1

CMake 4.3.2

RX 580 2048SP note: The mining-variant with 2048 shader processors (vs the original 2304SP) performs identically through Vulkan. Both are Polaris/GCN4.

NVMe impact: Upgrading from HDD to NVMe reduced FLUX.1 model load time from 25 minutes to ~30 seconds. Storage is as critical as compute.

Benchmarks (Real Logs)

Workload Model Backend Result

LLM inference Mistral 7B Q4_K_M RX 580 Vulkan 17–18 tok/s

LLM inference Qwen3 4B Q4_K_M RX 580 Vulkan (Linux) ~35 tok/s

LLM baseline Mistral 7B Q4_K_M Xeon CPU pure 3–5 tok/s

Image gen DreamShaper 8 (SD 1.5) RX 580 Vulkan ~72s / 512×512

Image gen flux1-schnell-q4_k GPU+CPU hybrid ~14 min @ 1024×1024

Image gen FLUX.1 fp8 (16GB) Xeon WSL2 CPU ~24 min

Audio transcription Whisper large-v3-turbo RX 580 Vulkan (Windows) 307s for 15min audio

Audio transcription Whisper large-v3-turbo RX 580 Vulkan (Linux) 23.58s for 106s audio

Video / AnimateDiff SD 1.5 pipeline Xeon WSL2 CPU ~141s/frame

Voice clone inference Applio RVC Xeon CPU (2h audio) ~30 min processing

Whisper on Linux (Mesa RADV) is absurdly faster than Windows — ~150× speedup over pure CPU. VRAM usage: only 1.6GB of 8GB available.

Architecture: Dual-Path Stack

The core insight of this project: not every workload fits in 8GB of VRAM. The solution is routing intelligently between GPU and CPU rather than forcing everything through one path.

OpenWebUI :3000 (Docker) │ ├──► llama-server :8081 ──► RX 580 Vulkan [llama.cpp] │ └── Ollama :11434 ──► CPU fallback │ └──► sd-server :7860 ──► RX 580 Vulkan [stable-diffusion.cpp] ├── SD 1.5 GGUF ──► 72s / image ✅ └── FLUX hybrid ──► ~14 min / image ✅

└──► ComfyUI :8188 ──► Xeon CPU WSL2 [heavy models > 8GB VRAM]

Path 1 — GPU Vulkan (RX 580): All LLM inference + SD 1.5 image generation. Fast, responsive, daily driver.

Path 2 — CPU Xeon (WSL2): FLUX.1 16GB models, AnimateDiff video pipelines. Slow but stable. The 32GB ECC RAM acts as "virtual VRAM."

Critical: Two GGUF Formats for FLUX

⚠️ This trips up almost everyone.

Source Compatible with

city96 (HuggingFace) ComfyUI + ComfyUI-GGUF node only

leejet (HuggingFace) stable-diffusion.cpp / sd-server ✅

Using a city96 GGUF in sd-server returns:

[ERROR] main.cpp:92 - new_sd_ctx_t failed

Always download FLUX weights from: huggingface.co/leejet/FLUX.1-schnell-gguf

What Failed (and Why)

We documented every dead end. These aren't opinions — they're error logs.

Attempt Error Root Cause

DirectML + ComfyUI NotImplementedError: Cannot access storage of OpaqueTensorImpl DirectML wraps tensors in opaque objects that ComfyUI's attention backends can't read. Also: abandoned by Microsoft, last update Sep 2024.

ROCm on Polaris Kernel panics under load AMD officially dropped GCN4/Polaris in ROCm v5.x. No Windows support either.

OpenVINO + Forge ModuleNotFoundError: No module named 'ldm' Extension targets old A1111 architecture. Forge restructured ldm/sgm modules completely.

CPU-only + HDD ~19 min/image, 85s startup No GPU acceleration + mechanical I/O bottleneck. The HDD was the hidden killer.

torch-directml + Applio Version conflict torch-directml requires torch==2.4.1. Applio requires torch==2.7.1. Irreconcilable.

Full autopsy with logs: docs/what-failed.md

Quick Start: LLM via Vulkan (Windows)

Run these commands in Developer PowerShell for Visual Studio.

Clone and compile with Vulkan backend

cd E:\ git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j20

Validate GPU detection

cd build\bin\Release .\llama-cli.exe --list-devices

Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅

Start LLM server

.\llama-server.exe -m "E:\models\Mistral-7B-Q4_K_M.gguf" ` --host 0.0.0.0 --port 8081 --device Vulkan0

Verify it's using the GPU (not CPU):

log output during inference: ggml_vulkan: Found 1 Vulkan device(s) ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB 17.77 t/s ← RX 580 Vulkan ✅

If you see 3–5 t/s with no ggml_vulkan line — it's running on CPU. Check that --device Vulkan0 is present.

Quick Start: Image Generation via Vulkan

Clone with submodules (required for ggml dependency)

git clone --recursive https://github.com/leejet/stable-diffusion.cpp cd stable-diffusion.cpp mkdir build && cd build cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release cmake --build . --config Release -j20

Successful build log:

-- Found Vulkan: C:/VulkanSDK/1.4.341.1/Lib/vulkan-1.lib

[100%] Built target sd-server ✅

Start SD server (SD 1.5)

E: cd "E:\stable-diffusion.cpp\build\bin\Release" .\sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ` -m "E:\models\dreamshaper8.gguf"

Server output confirms GPU:

ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB

Server listening on http://0.0.0.0:7860 ✅

Flag compatibility note: Older builds use --host / --port. Newer builds (master-600+) use --listen-ip / --listen-port. Run sd-server.exe --help to check which your build expects.

FLUX Hybrid Setup (GPU + CPU)

FLUX.1 Schnell requires ~16GB total. The strategy: put the diffusion model on VRAM, offload T5XXL and VAE to RAM.

Component File Allocation Size

Diffusion Model flux1-schnell-q4_k.gguf GPU (VRAM) ~6.5 GB

VAE ae.safetensors CPU (RAM) ~160 MB

CLIP L clip_l.safetensors GPU (VRAM) ~235 MB

T5XXL t5xxl_fp16.safetensors CPU (RAM) ~9.3 GB

sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^ --diffusion-model "E:\models\flux1-schnell-q4_k.gguf" ^ --vae "E:\models\ae.safetensors" ^ --clip_l "E:\models\clip_l.safetensors" ^ --t5xxl "E:\models\t5xxl_fp16.safetensors" ^ --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling

--vae-tiling is not optional — without it, VAE decode causes OOM and crashes the server. To save RAM: replace t5xxl_fp16 (~9.3GB) with t5xxl_fp8 (~5GB).

Timing per image (1024×1024):

Stage Time

T5XXL conditioning 11.49s

Sampling (4 steps) ~838s

VAE decode (9 tiles) 40.45s

Total ~14 min

Full memory architecture: docs/flux-setup.md

OpenWebUI + Docker Integration

docker run -d \ -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ --restart always \ ghcr.io/open-webui/open-webui:main

Connect LLM server:

Go to http://localhost:3000 → Admin Panel → Settings → Connections

Under OpenAI API, add:

URL: http://host.docker.internal:8081/v1

API Key: sk-local

Green badge = connected ✅

Connect image server:

Settings → Images → Engine: Automatic1111

URL: http://192.168.x.x:7860/ (use your local IP, not 127.0.0.1, with trailing slash)

Never use 127.0.0.1 for Docker connections — Docker runs in an isolated network and cannot reach the host's localhost. Use host.docker.internal for services, or your machine's LAN IP.

Windows Firewall fix (Docker subnet blocked by default):

Run as Administrator

New-NetFirewallRule -DisplayName "sd-server AIVisionsLab" ` -Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow

Full networking guide: docs/firewall-fix.md

whisper.cpp: Audio Transcription on RX 580

Vulkan-accelerated audio transcription. The large-v3-turbo model uses only 2.6GB of VRAM — plenty of headroom.

Compile (Developer PowerShell):

Activate MSVC environment first (required each session)

& "C:\Program Files (x86)\Microsoft Visual Studio\...\vcvars64.bat"

cd C:\ git clone https://github.com/ggml-org/whisper.cpp cd whisper.cpp cmake -B build -DGGML_VULKAN=ON -DGGML_HIPBLAS=OFF -DGGML_HIP=OFF -DGGML_CUDA=OFF cmake --build build --config Release -j4

Download model:

Invoke-WebRequest ` -Uri "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin" ` -OutFile "models\ggml-large-v3-turbo.bin"

Transcribe (MP4 → TXT):

Step 1: Extract audio (Whisper requires WAV on Windows)

ffmpeg -i "video.mp4" -ar 16000 -ac 1 -c:a pcm_s16le "audio.wav"

Step 2: Transcribe

.\build\bin\Release\whisper-cli.exe ` -m models\ggml-large-v3-turbo.bin ` -f "audio.wav" -l pt --output-txt

With translation to English:

.\build\bin\Release\whisper-cli.exe ` -m models\ggml-large-v3-turbo.bin ` -f "audio.wav" -l pt --translate --output-txt

Performance (15-min video, Windows):

Stage Time

Model load 4s

Mel spectrogram 1.2s

GPU encode 73s

Decode + batch 168s

Total 307s

VRAM used: 2.6GB of 8GB. CPU stays at ~5%.

⚠️ WSL2 does not expose the RX 580 to Vulkan — always use native Windows PowerShell for GPU transcription. ⚠️ --translate only outputs English. For other target languages, add a translation step after.

Applio RVC: Voice Cloning on AMD Windows

Full pipeline: Text → Balabolka (TTS) → WAV → Applio RVC (voice conversion) → final audio

Why this pipeline instead of pure TTS:

Aspect Pure XTTS Antônio Neural → Yuri RVC

Prosody Artificial Human (real actor)

Long texts Degrades Stable

Vocal identity Generic Cloned

Naturalness 60–70% 80–95%

Key findings for AMD Windows (2026):

DirectML acceleration is effectively dead — torch-directml requires torch==2.4.1 while Applio requires torch==2.7.1. The version conflict is irreconcilable. Use CPU mode — it works, just takes time.

Training speed on Xeon E5-2690 v3: ~6 min/epoch. 200 epochs = ~20 hours.

Critical gotchas:

NEVER set these — they silently break feature extraction:

set CUDA_VISIBLE_DEVICES=-1

set ROCM_VISIBLE_DEVICES=-1

They leave logs/project/extracted/ empty, training "succeeds" but produces no

[truncated for AI cost control]