2026-06-30 14:21 UTCIn-site rewrite7 min readUpdated: 2026-06-30 14:28 UTC

The AI Model Accessibility Checker

AIMAC, an initiative by the GAAD Foundation in partnership with ServiceNow, evaluated 37 top AI models on web page accessibility across 28 categories. OpenAI's GPT 5.4 Mini and GPT 5.3 Codex tied for first with a median AIMAC debt of 0.00. Alibaba's Qwen and Z.ai's GLM 4.7 Flash also performed well. Low contrast text is the most common accessibility issue in AI-generated pages, appearing in 84.2% of pages.

SourceHacker News AIAuthor: MavisBacon

AIMAC: The AI Model Accessibility Checker

AI is writing more code than ever. But is it accessible to People with Disabilities?

We prompted the top AI models to build web pages across 28 categories and audited them for accessibility. We published every generated page, side by side, so you can see how different models tackled the same design challenge. We even measured emdash usage.

AIMAC is an initiative by the GAAD Foundation, in partnership with ServiceNow. Updated Jun 20, 2026.

Leaderboard

Legend: 🅿️ Pareto Optimal | 🏆 Lowest Debt

AIMAC Leaderboard showing model rankings by accessibility debt, violations, and cost

lower = better

in USD

1 🅿️ 🏆

OpenAI: GPT 5.4 Mini

0.00 1 *

$0.95

OpenAI: GPT 5.3 Codex

0.00 1 *

$3.02

OpenAI: GPT 5.5

3.32

$12.41

OpenAI: GPT 5.5 Pro

3.43

$128.11

5 🅿️

OpenAI: gpt oss 120b

3.85

$0.09

Qwen: Qwen3.5 397B A17B

4.09

$0.76

Z.ai: GLM 4.7 Flash

4.19

$0.10

Google: Gemini 3.1 Pro Preview

4.40

$4.16

MoonshotAI: Kimi K2.7 Code

4.48

$1.46

Qwen: Qwen3 Coder Next

4.54

$0.27

Anthropic: Claude Haiku 4.5

4.57

$2.30

Qwen: Qwen3.7 Plus

4.64

$0.61

Qwen: Qwen3.6 Flash

4.67

$0.58

Anthropic: Claude Opus 4.8

4.775

$9.33

NVIDIA: Nemotron 3 Super

4.778

$0.16

DeepSeek: DeepSeek V4 Flash

4.83

$0.10

Anthropic: Claude Fable 5

4.94

$22.33

Z.ai: GLM 5.2

4.95

$2.01

MiniMax: MiniMax M3

5.16

$0.73

DeepSeek: DeepSeek V4 Pro

5.43

$0.71

Qwen: Qwen3 Coder Plus

7.18

$0.62

Anthropic: Claude Opus 4.8 (Fast)

7.46

$19.52

Mistral: Mistral Medium 3.5

8.16

$1.65

Amazon: Nova 2 Lite

8.20

$0.43

Mistral: Mistral Large 3 2512

8.229

$0.43

Arcee AI: Trinity Large Thinking

8.230

$0.20

Mistral: Codestral 2508

8.66

$0.14

Qwen: Qwen3.7 Max

9.18

$1.79

Anthropic: Claude Sonnet 4.6

9.83

$12.64

Google: Gemma 4 31B

10.30

$0.17

31 🅿️

Google: Gemma 4 26B A4B

11.15

$0.06

Mistral: Mistral Small 4

12.92

$0.17

MoonshotAI: Kimi K2.6

14.77

$2.60

Kwaipilot: KAT Coder Pro V2

14.79

$0.43

Google: Gemini 3.5 Flash

15.00

$4.83

xAI: Grok 4.3

15.03

$0.56

xAI: Grok Build 0.1

16.42

$1.24

Total $237.68

1 #1 and #2 tied with an AIMAC Debt of 0.00. Tiebreaker: #1 averaged fewer violations (0.91 vs 0.94).

GPT 5.3 Codex shows a median AIMAC Debt of 0.00. This means at least half of the 28 categories had zero accessibility violations, but some categories still had minor issues (20 total violations across all categories).

Deep Dive

Analysis

Introduction

95.9% of the top million websites fail basic accessibility checks. WebAIM has tracked it for seven years. After six years of marginal improvement, 2026 reversed the trend: errors per page jumped 10% to 56.1 and the failure rate climbed back to 95.9%.

AI is writing more of the world's code every day. Vibe Coding was the Collins Dictionary Word of the Year. If AI keeps writing code as poorly as the developers it learned from, nothing changes. But if it prioritizes accessibility, the web gets its first real chance to improve.

Our one ambitious goal

Ensure that AI models write accessible code by default.

Which Model is Best?

The Pareto Frontier

AIMAC Debt vs Cost

Choosing a model isn't simply about which model is most accessible. Some models are very expensive. Benchmarks commonly use Pareto Frontier analysis to compare models on quality vs cost dimensions. Pareto optimal models (teal diamonds) are the efficient picks: to lower the AIMAC Debt grade, you'd pay more; to pay less, your AIMAC Debt grade rises. A gold ring marks the lowest AIMAC Debt.

Top 3 Winners

OpenAI

Alibaba/Qwen

Z.ai

OpenAI dominates the top of the leaderboard. Two of their models achieve a median AIMAC

Debt of 0.00: GPT 5.4 Mini (#1) and GPT 5.3 Codex (#2). A median of 0.00 means at least half of the 28 categories had zero violations, though a few categories still had minor issues. OpenAI holds all five top spots, including GPT 5.5 (#3), GPT 5.5 Pro (#4), and open-weight gpt oss 120b (#5) for just $0.09.

Alibaba/Qwen is the strongest non-OpenAI lab in this run. Qwen3.5 397B A17B ranks #6 with an AIMAC Debt of 4.09 for $0.76, and Qwen3 Coder Next ranks #10 at 4.54 for $0.27. Qwen also places Qwen3.7 Plus (#12) and Qwen3.6 Flash

(#13) just outside the top ten.

Z.ai has the most interesting value result outside OpenAI and Qwen. GLM 4.7 Flash ranks #7 with a debt of 4.19 for $0.10. The broader Z.ai story

is mixed: their flagship GLM 5.2 lands mid-pack at #18, but GLM 4.7 Flash still earns its place as a cheap, efficient coding model that newer flagship releases have not replaced.

Google's Model Mix

Google's Gemini 3 Pro Preview once finished dead last at #39 with an AIMAC Debt of 10.65. We were disappointed because Google has the tools and talent to do better.

Its replacement, Gemini 3.1 Pro Preview, now sits at #8 with an AIMAC Debt of 4.40. Google is no longer at the bottom. The rest of Google's roster is mixed: the open-weight Gemma 4 31B and Gemma 4 26B A4B are inexpensive but land near the lower third, and Gemini 3.5 Flash finishes at #35. Still, Gemini 3.1 Pro Preview is exactly the kind of progress we hoped this benchmark would encourage.

What About Claude?

Anthropic has real developer mindshare, and Claude is often the default choice when teams want strong coding help. In this benchmark, their best result is Claude Haiku 4.5 at rank #11 (4.57 for $2.30), followed by Claude Opus 4.8 at #14 (4.775 for $9.33).

That is real progress for Opus, but it is not the whole Claude story. Claude Opus 4.8 (Fast) drops to #22 (7.46 for $19.52), and Claude Sonnet 4.6 sits at #29 (9.83 for $12.64). Sonnet generates 1,186 accessibility violations, more than two and a half times the field average of 439. The same gap shows in Anthropic's frontend-design skill, where accessibility is barely mentioned while the guidance is overwhelmingly visual.

On June 9, 2026, Anthropic released Claude Fable 5, the public version of Mythos, a model it treats as so powerful that it's dangerous. The unfiltered version goes only to vetted cyberdefenders, because ostensibly, it finds software vulnerabilities that expert humans miss. Anthropic's CEO says models like it will soon write most of the world's code and do the work of most software engineers.

When it comes to accessibility, Fable ranks #17 of 37, the middle of the pack. This is certainly partly because they are not prioritizing accessibility. But if we are to believe the claims that AI is just getting so smart that it's like AGI, Artificial General Intelligence, and that it's going to replace humans at all these jobs, then wouldn't it get consistently better at accessibility? Or perhaps it's just marketing?

Anthropic is a Public Benefit Corporation whose stated values include acting "for the global good" and being good to their users, whom they define broadly as "anyone impacted by the technology we build." People with disabilities are impacted every time Claude generates inaccessible code. We hope Anthropic will bring the same energy they apply to AI safety to the accessibility of their models' output.

Beyond the Rankings

As AI models get better at visual design, they face a tradeoff between optimizing for beauty or for accessibility. You can achieve both, but it requires deliberate effort.

If you're picking a model for your workflow, start with a category page that matches your use case. Each category shows how 37 models handled the same prompt, alongside their AIMAC Debt. For example, visit the Sports category in grid mode to compare sports designs side by side.

If you already have a model in mind, use the model detail page to see what its pages look like across all 28 categories. Then drill down into the categories that match your needs. See GPT 5.5 Pro in grid mode for an example.

Value vs Premium

Closed vs Open (Source/Weights)

A closed model holds the top spot. GPT 5.4 Mini (proprietary) ranks #1, and closed OpenAI models fill spots #2 through #4. Then the open-weight tier arrives fast: OpenAI's gpt oss 120b ranks #5 for 9 cents, Qwen3.5 397B A17B ranks #6 for 76 cents, and GLM 4.7 Flash ranks #7 for 10 cents.

The upper-middle tier remains packed with value picks across both closed and open-weight releases. Qwen3 Coder Next ranks #10 for 27 cents, Qwen3.7 Plus ranks #12 for 61 cents, and Qwen3.6 Flash ranks #13 for 58 cents.

China-based labs remain a large part of the field. Qwen, Z.ai, DeepSeek, MiniMax, MoonshotAI, and Kwaipilot account for 14 of 37 models. Alibaba's Qwen line has six entries, and MoonshotAI's new Kimi K2.7 Code debuts at #9 for $1.46, the company's strongest result on this benchmark. Despite US chip restrictions, Chinese labs keep shipping.

DeepSeek made headlines in January 2025 when their R1 model triggered a $600 billion single-day loss in Nvidia's market cap by claiming similar performance to Western models at a fraction of the cost. Their new V4 endpoints are meaningfully better on AIMAC: DeepSeek V4 Flash ranks #16 with a debt of 4.83 for 10 cents, and DeepSeek V4 Pro ranks #20 with a debt of 5.43 for 71 cents. That is not top-tier accessibility, but it is no longer the near-bottom DeepSeek story from the previous roster.

Mistral still struggles on this benchmark. Medium 3.5 is their best result at #23 with a debt of 8.16, while Large 3 2512 ranks #25, Codestral 2508 ranks #27, and Mistral Small 4 ranks #32. Their models are affordable, but none crack the top 20 on AIMAC accessibility.

What Trips Models Up

Low contrast text dominates both AI-generated and human-built websites. Both columns report the share of pages with at least one instance of each issue. AIMAC uses axe-core across AI-generated pages; WebAIM uses WAVE across the top 1,000,000 home pages.

AIMAC: Top 6 issues

Share of pages with >= 1 error

Low contrast text: 84.2%

Empty links: 28.0%

Missing form labels: 26.3%

Empty buttons: 6.0%

Target size too small: 4.0%

Links distinguished only by color: 3.9%

WebAIM Million 2026: Top 6 issues

Share of home pages with >= 1 error

Low contrast text: 84%

Missing alt text: 53%

Missing form labels: 51%

Empty links: 46%

Empty buttons: 31%

Missing document language: 14%

Emdash Benchmark

We tracked emdash usage as a small writing-style signal, because punctuation can affect how text is read aloud.

Screen readers handle emdashes differently. Some announce "em dash" at every occurrence, others treat it as a pause, and some ignore it depending on voice and settings. Ricky Onsman explored the issue and found this is largely a screen reader behavior difference, not a content authoring bug.

We sanity-checked this with screen reader users, and none of our friends reported emdash-heavy text as a practical problem in day-to-day reading. So this turned out to be more interesting than impactful, and it does not affect AIMAC rankings.

Emdash usage varies widely. Across 37 models, counts range from 0 (Codestral 2508) to 754 (Claude Sonnet 4.6). 1 model uses zero emdashes.

View emdash leaderboard (37 models)

Emdash usage leaderboard showing model names and emdash counts

Rank Model Emdash Count

Codestral 2508 Mistral 0

Mistral Large 3 2512 Mistral 3

Nova 2 Lite Amazon 3

Mistral Medium 3.5 Mistral 4

Qwen3.5 397B A17B Qwen 7

Trinity Large Thinking Arcee AI 7

GLM 4.7 Flash Z.ai 9

Qwen3.7 Plus Qwen 9

KAT Coder Pro V2 Kwaipilot 10

Gemini 3.1 Pro Preview Google 17

Qwen3 Coder Plus Qwen 18

Gemma 4 26B A4B Google 20

Grok 4.3 xAI 23

Gemini 3.5 Flash Google 24

Mistral Small 4 Mistral 26

GPT 5.5 Pro OpenAI 35

GPT 5.5 OpenAI 38

Gemma 4 31B Google 40

gpt oss 120b OpenAI 46

DeepSeek V4 Pro DeepSeek 57

Claude Haiku 4.5 Anthropic

[truncated for AI cost control]