AI News HubLIVE
In-site rewrite6 min read

Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)

A detailed guide on how to run AI inference on an AMD RX 580 using Vulkan, bypassing CUDA and ROCm, with benchmarks, dual-path architecture, and step-by-step setup for LLMs, image generation, audio transcription, and voice cloning.

SourceHacker News AIAuthor: aivisionslab

Notifications You must be signed in to change notification settings

Fork 0

Star 1

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

20 Commits

20 Commits

content

content

docs

docs

public

public

scripts

scripts

src

src

.env.example

.env.example

.gitattributes

.gitattributes

.gitignore

.gitignore

GITHUB_RELEASE_POST.md

GITHUB_RELEASE_POST.md

LICENSE

LICENSE

README.md

README.md

convert-og.js

convert-og.js

firebase.json

firebase.json

generate_ssg.js

generate_ssg.js

index.html

index.html

metadata.json

metadata.json

migrate_to_md.js

migrate_to_md.js

package-lock.json

package-lock.json

package.json

package.json

tsconfig.json

tsconfig.json

vite.config.ts

vite.config.ts

vulkan-diagnostic.bat

vulkan-diagnostic.bat

vulkan-diagnostic.sh

vulkan-diagnostic.sh

Repository files navigation

GPU from 2017. SOTA AI in 2026. No CUDA. No ROCm. No cloud. No excuses.

The Narrative

"Your RX 580 can't run AI. Buy a new GPU."

AMD dropped ROCm for Polaris/GCN4 in v5.x. DirectML crashes with OpaqueTensorImpl. OpenVINO fails silently on Forge. The mainstream AI stack gave up on this card.

We didn't.

By compiling llama.cpp and stable-diffusion.cpp from source with Vulkan support, the RX 580 runs real, useful AI inference in 2026 — locally, offline, privately. This repository is the complete technical record of how.

RX 580 8GB ──► Vulkan API ──► ggml engine ──► 17 tok/s LLM + 72s/image SD Xeon 2014 ──► WSL2 CPU ──► ComfyUI ──► FLUX 16GB + AnimateDiff

Table of Contents

Hardware

Benchmarks

Architecture: Dual-Path Stack

Critical: Two GGUF Formats for FLUX

What Failed (and Why)

Quick Start: LLM via Vulkan

Quick Start: Image Generation via Vulkan

FLUX Hybrid Setup

OpenWebUI + Docker Integration

whisper.cpp: Audio Transcription

Applio RVC: Voice Cloning

AnimateDiff: Video Generation

Linux Native: Ubuntu 26.04 LTS

Windows vs Linux Comparison

Troubleshooting

Automation Scripts

Community Timeline

Pushing the 35B Limit: Qwen3.5 MoE Hybrid Experiment

Repository Structure

Hardware

Component Spec

GPU AMD RX 580 2048SP 8GB GDDR5 (Polaris / GCN4)

CPU Intel Xeon E5-2690 v3 — 12c/24t · 3.5GHz (2014)

RAM 32GB DDR4 REG ECC Quad Channel

Storage NVMe 1TB — 1.7–3.5 GB/s read

OS Windows 10 Pro + WSL2 Ubuntu 22.04.5 / Ubuntu 26.04 LTS

AMD Driver 31.0.21924.61 (Amdnolk, Nov 2025)

Vulkan SDK 1.4.341.1

CMake 4.3.2

RX 580 2048SP note: The mining-variant with 2048 shader processors (vs the original 2304SP) performs identically through Vulkan. Both are Polaris/GCN4.

NVMe impact: Upgrading from HDD to NVMe reduced FLUX.1 model load time from 25 minutes to ~30 seconds. Storage is as critical as compute.

Benchmarks (Real Logs)

Workload Model Backend Result

LLM inference Mistral 7B Q4_K_M RX 580 Vulkan 17–18 tok/s

LLM inference Qwen3 4B Q4_K_M RX 580 Vulkan (Linux) ~35 tok/s

LLM baseline Mistral 7B Q4_K_M Xeon CPU pure 3–5 tok/s

Image gen DreamShaper 8 (SD 1.5) RX 580 Vulkan ~72s / 512×512

Image gen flux1-schnell-q4_k GPU+CPU hybrid ~14 min @ 1024×1024

Image gen FLUX.1 fp8 (16GB) Xeon WSL2 CPU ~24 min

Audio transcription Whisper large-v3-turbo RX 580 Vulkan (Windows) 307s for 15min audio

Audio transcription Whisper large-v3-turbo RX 580 Vulkan (Linux) 23.58s for 106s audio

Video / AnimateDiff SD 1.5 pipeline Xeon WSL2 CPU ~141s/frame

Voice clone inference Applio RVC Xeon CPU (2h audio) ~30 min processing

Whisper on Linux (Mesa RADV) is absurdly faster than Windows — ~150× speedup over pure CPU. VRAM usage: only 1.6GB of 8GB available.

Architecture: Dual-Path Stack

The core insight of this project: not every workload fits in 8GB of VRAM. The solution is routing intelligently between GPU and CPU rather than forcing everything through one path.

OpenWebUI :3000 (Docker) │ ├──► llama-server :8081 ──► RX 580 Vulkan [llama.cpp] │ └── Ollama :11434 ──► CPU fallback │ └──► sd-server :7860 ──► RX 580 Vulkan [stable-diffusion.cpp] ├── SD 1.5 GGUF ──► 72s / image ✅ └── FLUX hybrid ──► ~14 min / image ✅

└──► ComfyUI :8188 ──► Xeon CPU WSL2 [heavy models > 8GB VRAM]

Path 1 — GPU Vulkan (RX 580): All LLM inference + SD 1.5 image generation. Fast, responsive, daily driver.

Path 2 — CPU Xeon (WSL2): FLUX.1 16GB models, AnimateDiff video pipelines. Slow but stable. The 32GB ECC RAM acts as "virtual VRAM."

Critical: Two GGUF Formats for FLUX

⚠️ This trips up almost everyone.

Source Compatible with

city96 (HuggingFace) ComfyUI + ComfyUI-GGUF node only

leejet (HuggingFace) stable-diffusion.cpp / sd-server ✅

Using a city96 GGUF in sd-server returns:

[ERROR] main.cpp:92 - new_sd_ctx_t failed

Always download FLUX weights from: huggingface.co/leejet/FLUX.1-schnell-gguf

What Failed (and Why)

We documented every dead end. These aren't opinions — they're error logs.

Attempt Error Root Cause

DirectML + ComfyUI NotImplementedError: Cannot access storage of OpaqueTensorImpl DirectML wraps tensors in opaque objects that ComfyUI's attention backends can't read. Also: abandoned by Microsoft, last update Sep 2024.

ROCm on Polaris Kernel panics under load AMD officially dropped GCN4/Polaris in ROCm v5.x. No Windows support either.

OpenVINO + Forge ModuleNotFoundError: No module named 'ldm' Extension targets old A1111 architecture. Forge restructured ldm/sgm modules completely.

CPU-only + HDD ~19 min/image, 85s startup No GPU acceleration + mechanical I/O bottleneck. The HDD was the hidden killer.

torch-directml + Applio Version conflict torch-directml requires torch==2.4.1. Applio requires torch==2.7.1. Irreconcilable.

Full autopsy with logs: docs/what-failed.md

Quick Start: LLM via Vulkan (Windows)

Run these commands in Developer PowerShell for Visual Studio.

Clone and compile with Vulkan backend

cd E:\ git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j20

Validate GPU detection

cd build\bin\Release .\llama-cli.exe --list-devices

Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅

Start LLM server

.\llama-server.exe -m "E:\models\Mistral-7B-Q4_K_M.gguf" ` --host 0.0.0.0 --port 8081 --device Vulkan0

Verify it's using the GPU (not CPU):

log output during inference: ggml_vulkan: Found 1 Vulkan device(s) ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB 17.77 t/s ← RX 580 Vulkan ✅

If you see 3–5 t/s with no ggml_vulkan line — it's running on CPU. Check that --device Vulkan0 is present.

Quick Start: Image Generation via Vulkan

Clone with submodules (required for ggml dependency)

git clone --recursive https://github.com/leejet/stable-diffusion.cpp cd stable-diffusion.cpp mkdir build && cd build cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release cmake --build . --config Release -j20

Successful build log:

-- Found Vulkan: C:/VulkanSDK/1.4.341.1/Lib/vulkan-1.lib

[100%] Built target sd-server ✅

Start SD server (SD 1.5)

E: cd "E:\stable-diffusion.cpp\build\bin\Release" .\sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ` -m "E:\models\dreamshaper8.gguf"

Server output confirms GPU:

ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB

Server listening on http://0.0.0.0:7860 ✅

Flag compatibility note: Older builds use --host / --port. Newer builds (master-600+) use --listen-ip / --listen-port. Run sd-server.exe --help to check which your build expects.

FLUX Hybrid Setup (GPU + CPU)

FLUX.1 Schnell requires ~16GB total. The strategy: put the diffusion model on VRAM, offload T5XXL and VAE to RAM.

Component File Allocation Size

Diffusion Model flux1-schnell-q4_k.gguf GPU (VRAM) ~6.5 GB

VAE ae.safetensors CPU (RAM) ~160 MB

CLIP L clip_l.safetensors GPU (VRAM) ~235 MB

T5XXL t5xxl_fp16.safetensors CPU (RAM) ~9.3 GB

sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^ --diffusion-model "E:\models\flux1-schnell-q4_k.gguf" ^ --vae "E:\models\ae.safetensors" ^ --clip_l "E:\models\clip_l.safetensors" ^ --t5xxl "E:\models\t5xxl_fp16.safetensors" ^ --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling

--vae-tiling is not optional — without it, VAE decode causes OOM and crashes the server. To save RAM: replace t5xxl_fp16 (~9.3GB) with t5xxl_fp8 (~5GB).

Timing per image (1024×1024):

Stage Time

T5XXL conditioning 11.49s

Sampling (4 steps) ~838s

VAE decode (9 tiles) 40.45s

Total ~14 min

Full memory architecture: docs/flux-setup.md

OpenWebUI + Docker Integration

docker run -d \ -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ --restart always \ ghcr.io/open-webui/open-webui:main

Connect LLM server:

Go to http://localhost:3000 → Admin Panel → Settings → Connections

Under OpenAI API, add:

URL: http://host.docker.internal:8081/v1

API Key: sk-local

Green badge = connected ✅

Connect image server:

Settings → Images → Engine: Automatic1111

URL: http://192.168.x.x:7860/ (use your local IP, not 127.0.0.1, with trailing slash)

Never use 127.0.0.1 for Docker connections — Docker runs in an isolated network and cannot reach the host's localhost. Use host.docker.internal for services, or your machine's LAN IP.

Windows Firewall fix (Docker subnet blocked by default):

Run as Administrator

New-NetFirewallRule -DisplayName "sd-server AIVisionsLab" ` -Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow

Full networking guide: docs/firewall-fix.md

whisper.cpp: Audio Transcription on RX 580

Vulkan-accelerated audio transcription. The large-v3-turbo model uses only 2.6GB of VRAM — plenty of headroom.

Compile (Developer PowerShell):

Activate MSVC environment first (required each session)

& "C:\Program Files (x86)\Microsoft Visual Studio\...\vcvars64.bat"

cd C:\ git clone https://github.com/ggml-org/whisper.cpp cd whisper.cpp cmake -B build -DGGML_VULKAN=ON -DGGML_HIPBLAS=OFF -DGGML_HIP=OFF -DGGML_CUDA=OFF cmake --build build --config Release -j4

Download model:

Invoke-WebRequest ` -Uri "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin" ` -OutFile "models\ggml-large-v3-turbo.bin"

Transcribe (MP4 → TXT):

Step 1: Extract audio (Whisper requires WAV on Windows)

ffmpeg -i "video.mp4" -ar 16000 -ac 1 -c:a pcm_s16le "audio.wav"

Step 2: Transcribe

.\build\bin\Release\whisper-cli.exe ` -m models\ggml-large-v3-turbo.bin ` -f "audio.wav" -l pt --output-txt

With translation to English:

.\build\bin\Release\whisper-cli.exe ` -m models\ggml-large-v3-turbo.bin ` -f "audio.wav" -l pt --translate --output-txt

Performance (15-min video, Windows):

Stage Time

Model load 4s

Mel spectrogram 1.2s

GPU encode 73s

Decode + batch 168s

Total 307s

VRAM used: 2.6GB of 8GB. CPU stays at ~5%.

⚠️ WSL2 does not expose the RX 580 to Vulkan — always use native Windows PowerShell for GPU transcription. ⚠️ --translate only outputs English. For other target languages, add a translation step after.

Applio RVC: Voice Cloning on AMD Windows

Full pipeline: Text → Balabolka (TTS) → WAV → Applio RVC (voice conversion) → final audio

Why this pipeline instead of pure TTS:

Aspect Pure XTTS Antônio Neural → Yuri RVC

Prosody Artificial Human (real actor)

Long texts Degrades Stable

Vocal identity Generic Cloned

Naturalness 60–70% 80–95%

Key findings for AMD Windows (2026):

DirectML acceleration is effectively dead — torch-directml requires torch==2.4.1 while Applio requires torch==2.7.1. The version conflict is irreconcilable. Use CPU mode — it works, just takes time.

Training speed on Xeon E5-2690 v3: ~6 min/epoch. 200 epochs = ~20 hours.

Critical gotchas:

NEVER set these — they silently break feature extraction:

set CUDA_VISIBLE_DEVICES=-1

set ROCM_VISIBLE_DEVICES=-1

They leave logs/project/extracted/ empty, training "succeeds" but produces no

[truncated for AI cost control]