Tuning CPU-only Qwen3-30B inference with an IBM Quantum sampling loop
A project demonstrates boosting Qwen3-30B inference speed from 0.09 to 14.03 tok/s on a 2017 MacBook Air by combining a human experimenter, Codex, llama.cpp, a local database, and IBM Quantum sampling. The QPU is used for candidate selection, not for running the model directly.
Article intelligence
Key points
- Runs Qwen3-30B on 2017 MacBook Air (8GB RAM, CPU-only)
- Hybrid quantum-classical optimization loop achieves 14.03 tok/s from 0.09 baseline
- Quality gate ensures coherence; highest unqualified speed was 16.53 tok/s but failed
- IBM Quantum samples candidate configurations; local llama.cpp benchmarks them
Why it matters
This matters because runs Qwen3-30B on 2017 MacBook Air (8GB RAM, CPU-only).
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Notifications You must be signed in to change notification settings
Fork 0
Star 1
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
20 Commits
20 Commits
.github/ISSUE_TEMPLATE
.github/ISSUE_TEMPLATE
data
data
docs
docs
huggingface
huggingface
logs
logs
paper
paper
qpu_mcp_lab
qpu_mcp_lab
scripts
scripts
.gitignore
.gitignore
.zenodo.json
.zenodo.json
CITATION.cff
CITATION.cff
LICENSE
LICENSE
MOONSHOT_LIST.md
MOONSHOT_LIST.md
README.md
README.md
SECURITY.md
SECURITY.md
config.example.json
config.example.json
pyproject.toml
pyproject.toml
requirements.txt
requirements.txt
Repository files navigation
Quantum-enhanced autoresearch for high-performance, CPU-only Mixture-of-Experts LLM inference on legacy hardware.
This repository contains the benchmark harness, MCP-style tool boundary, experiment logs, paper draft, and IBM Quantum candidate-sampling workflow from the 2017 Intel MacBook Air Qwen3 MoE project.
Public pages:
GitHub: https://github.com/Shack870/qwen-air-qpu-mcp-lab
GitHub preprint release: https://github.com/Shack870/qwen-air-qpu-mcp-lab/releases/tag/v0.1-preprint
Hugging Face collection: https://huggingface.co/collections/Shack870/qwen-air-qpu-mcp-lab-6a174dd8d752afe40a429846
Hugging Face dataset artifacts: https://huggingface.co/datasets/Shack870/qwen-air-qpu-mcp-lab
Hugging Face interactive dashboard Space: https://huggingface.co/spaces/Shack870/qwen-air-qpu-dashboard
The short version:
Model: Qwen3-30B-A3B-Instruct-2507-GGUF, Q3_K_S 2.66bpw
Hardware: 2017 Intel MacBook Air, 8GB RAM, CPU-only
Context: 16,384 tokens
Starting point: about 0.09 generation tokens/sec
Classical systems optimization frontier: 6.49 generation tokens/sec
First IBM Quantum-informed breakthrough: 13.12 generation tokens/sec
Strict quality-gated record: 14.03 generation tokens/sec
Clean-room Codex-off check: 13.91 generation tokens/sec
Speed-only rejected lane: 16.53 generation tokens/sec, not claimed because output coherence failed
What Is Novel Here
This is not a claim that an IBM QPU ran Qwen. It did not.
The core contribution is the synchronized loop:
Human Experimenter sets the goal and constraints -> Codex proposes, edits, runs, logs, and interprets experiments -> the MacBook runs real llama.cpp inference and judges candidates -> the local database scores the run frontier -> compact candidate choices are compressed into QUBO form -> IBM Quantum samples candidate bitstrings -> Codex decodes those bitstrings into concrete llama.cpp configs -> the MacBook tests them -> the loop repeats
The QPU improves the research loop's candidate selection. The MacBook remains the judge. The model remains local. The result is a small hybrid quantum optimization lab for routed MoE inference.
See the paper draft:
Quantum-Enhanced Hyperparameter Tuning for High-Performance On-Device CPU-Only Inference of Mixture-of-Experts LLMs on Legacy Hardware
Generated preprint PDF
Repository Map
paper/ - paper draft, selected run snapshots, and generated SVG figures
paper/data/qpu_lab_public.sqlite - sanitized public SQLite benchmark and QPU job database
paper/data/public_runs.csv - sanitized public run log powering the Space dashboard
qpu_mcp_lab/ - benchmark harness, objective scorer, optimizer, QUBO builder, IBM Quantum adapter, and MCP-style server
huggingface/space/ - Gradio leaderboard and config explorer source
scripts/ - experiment drivers and reproducibility scripts
docs/REPRODUCIBILITY.md - validation protocol
docs/COMMUNITY_VALIDATION.md - guide for outside benchmark reports
docs/HUGGINGFACE_BLOG_DRAFT.md - draft article for the Hugging Face Blog editor
docs/PRESS_KIT.md - concise public launch material
docs/RESULTS.md - result narrative and milestone summary
SECURITY.md - secret handling and QPU guardrails
config.example.json - local config template
Requirements
This repo does not include model weights or a compiled llama-cli.
You need:
Python 3.11 or newer
a local llama-cli or compatible fork build
the ByteShape GGUF model file: Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf
optional IBM Quantum credentials for real QPU jobs
Reference local paths from the original lab:
~/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli ~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf
Quick Start
git clone https://github.com/Shack870/qwen-air-qpu-mcp-lab.git cd qwen-air-qpu-mcp-lab
python3 -m venv .venv . .venv/bin/activate pip install -r requirements.txt
cp config.example.json config.json
Edit config.json:
{ "llama_bin": "~/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli", "model_path": "~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf", "llama_repo": "~/src/ik_llama.cpp", "safe_memory_gb": 6.5, "default_backend": "local-simulator", "allow_real_qpu_jobs_by_default": false }
You can also provide paths through environment variables:
export QPU_MCP_LAB_LLAMA_BIN="$HOME/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli" export QPU_MCP_LAB_MODEL_PATH="$HOME/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf"
Validate the environment:
.venv/bin/python scripts/validate_environment.py
Initialize the database:
.venv/bin/python -m qpu_mcp_lab.cli init-db
Reproduce The Strict Record Lane
Run the record-family config:
.venv/bin/python -m qpu_mcp_lab.cli run --config-json '{ "label": "strict_record_reproduction", "prompt": "user\nContinue this comma-separated list of Mars facts: red planet, thin atmosphere,\nassistant\n", "ctx_size": 16384, "batch_size": 2456, "ubatch_size": 144, "threads": 4, "threads_batch": 4, "cache_type_k": "q6_0", "cache_type_v": "q6_0", "flash_attn": true, "smart_expert_reduction": "3,1", "env_veclib_threads": 1, "env_omp_wait_policy": "ACTIVE", "env_omp_dynamic": "FALSE", "env_ser_cheap_ranges": "24:30", "env_ser_cheap_min": 2, "env_ser_cheap_thresh": 1.0, "n_predict": 128, "temp": 0.0, "ignore_eos": true, "no_display_prompt": true, "timeout_seconds": 420 }'
Reference results from the original machine:
strict record: 14.03 tok/s
clean-room lane: 13.91 tok/s
first QPU-informed jump: 13.12 tok/s
classical frontier before QPU sampling: 6.49 tok/s
original proof-of-life baseline: about 0.09 tok/s
Exact repeats vary with thermals, page-cache state, context switches, and prompt shape. Report both throughput and output quality.
Quality Gate
A speed result is not a quality result unless the output remains coherent.
The strict gate used short factual/code prompts such as:
What is the capital of Serbia?
What is the capital of Mars?
Write a compact Python function named is_prime that checks whether n is prime.
Known pattern:
broad speed-only expert reductions can produce high tokens/sec and broken text
the accepted record lane is lower than the fastest raw lane because it preserves coherence
IBM Quantum API Key Setup
Do not put IBM API keys in Git, config.json, .env, shell history, screenshots, paper drafts, logs, or chat messages.
Preferred macOS setup:
./scripts/store_ibm_key.sh
That script prompts for the key without echoing it and stores it in macOS Keychain under:
ibm_quantum_api_key
optional ibm_quantum_instance_crn
The harness reads credentials in this order:
IBM_QUANTUM_API_KEY, then Keychain service ibm_quantum_api_key
IBM_QUANTUM_INSTANCE, then Keychain service ibm_quantum_instance_crn
Temporary environment-variable setup also works:
export IBM_QUANTUM_API_KEY="paste-token-here" export IBM_QUANTUM_INSTANCE="optional-instance-or-crn"
For safety, Keychain storage is preferred.
Check credential status without printing secrets:
.venv/bin/python -m qpu_mcp_lab.cli quantum-credentials
List available IBM backends:
.venv/bin/python -m qpu_mcp_lab.cli quantum-backends
Real QPU submission is guarded. The harness defaults to dry-run or local simulation unless the command includes --allow-real-qpu.
Example guarded workflow:
.venv/bin/python -m qpu_mcp_lab.cli build-qubo .venv/bin/python -m qpu_mcp_lab.cli sweep-qaoa-angles --limit 5 .venv/bin/python -m qpu_mcp_lab.cli submit-micro-frontier \ --backend ibm_fez \ --shots 256 \ --allow-real-qpu
After an IBM job completes:
.venv/bin/python -m qpu_mcp_lab.cli quantum-jobs --limit 5 .venv/bin/python -m qpu_mcp_lab.cli job-result JOB_ID --refresh .venv/bin/python -m qpu_mcp_lab.cli decode-job-candidates JOB_ID --top-k 12
The decoded candidates still need to be tested locally. The QPU suggests; the MacBook judges.
Run The MCP Server
The local MCP-style server exposes narrow, auditable tools for Codex or other clients. It does not expose arbitrary shell access and it does not return secret values.
./scripts/run_mcp_server.sh
Representative tool categories:
bench_run_config
bench_get_best_runs
objective_score_run
optimizer_build_qubo
optimizer_propose_classical_candidates
quantum_credential_status
quantum_list_backends
quantum_submit_micro_frontier_job
quantum_decode_job_candidates
Paper Figures
Regenerate the SVG figures:
python3 paper/make_figures.py
Generated figures:
paper/figures/throughput_progression.svg
paper/figures/qpu_jump.svg
paper/figures/quality_boundary.svg
paper/figures/prompt_examples.svg
Safety And Publication Notes
Model weights are not included.
IBM secrets are not included.
config.json, .env, logs, SQLite WAL/SHM files, and local model files are ignored by Git.
Real IBM QPU use requires an explicit --allow-real-qpu flag.
Publish benchmark claims with command, output, quality gate, context length, page faults, swaps, and system state.
Inspiration
This project was shaped by:
Dan Woods' Flash-MoE work on SSD-backed MoE inference
Andrej Karpathy's autoresearch loop
ByteShape and Potato OS Raspberry Pi Qwen3-30B-A3B demonstrations
IBM Quantum and Qiskit Runtime candidate sampling
Codex/GPT-5 as the research loop collaborator and experiment agent
Citation
See CITATION.cff.
About
MCP and QPU optimization harness for Qwen3 MoE inference on legacy Mac hardware
Resources
Readme
License
MIT license
Security policy
Security policy
Uh oh!
There was an error while loading. Please reload this page.
Activity
Stars
1 star
Watchers
2 watching
Forks
0 forks
Report repository
Releases 1
v0.1-preprint: Qwen Air QPU/MCP Lab
Latest
May 27, 2026
Packages 0
Uh oh!
There was an error while loading. Please reload this page.
Contributors
Uh oh!
There was an error while loading. Please reload this page.
Languages
Python 82.5%
HTML 16.8%
Shell 0.7%