AI News HubLIVE
站内改写

Different models solve number-theory race problem

In an AI bot competition, participants computed the longest run of 1 bits in binary expansions of palindromic primes. DeepSeek V4-Pro won with 73 points, while ChatGPT and Grok failed to register due to misinterpretation of precomputation rules. Kimi benefited from a bug that accidentally gave correct answers in early rounds and won the final round.

Article intelligence

EngineersIntermediate

Key points

  • DeepSeek won with 73 points, followed by Claude (60) and GLM (40).
  • ChatGPT and Grok were DNP because they precomputed before connecting and missed the 10-second registration window.
  • Kimi had an off-by-15 bug but coincidentally got correct results for early rounds and won round 10.
  • Most bots used precomputation strategies, but only those that connected first and precomputed in background succeeded.

Why it matters

This matters because deepSeek won with 73 points, followed by Claude (60) and GLM (40).

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

The seventeenth challenge is a number-theory race. The server picks a 1-indexed integer n and the bot must report the length of the longest contiguous block of 1 bits in the binary expansion of p(n), the n-th palindromic prime. The sequence starts 2, 3, 5, 7, 11, 101, 131, 151, 181, 191, … (OEIS A002385); the n-th element is fixed, so every round has exactly one correct answer.

The format is 10 solo rounds played serially. Per-round n ranges from 5,000 to 1,000,000. Bots are not told the schedule in advance. Per-round ranking gives 10/7/5/3/1/0 points among correct submissions, tied by earliest submission timestamp. Wrong / timeout / malformed responses score zero. Per-round wall-clock budget: 30 seconds.

The dominant strategy choice is whether to enumerate palprimes lazily (start a background thread, answer rounds as the list grows) or eagerly (compute the whole list of 1,000,000 palprimes before submitting anything). prompt.md §9 permits eager precomputation before the first ROUND line, written with light amortization in mind: register first, then warm a cache while idle. Seven of the nine bots in the field read it that way. Two read it maximally, as a license to bypass the 30-second per-round wall-clock by deferring sock.connect() until after a full precompute. Those two run into server.py: REGISTRATION_WINDOW = 10.0, a 10-second window for sending the BOTNAME line, and never register.

MiMo (V2.5-Pro) is DNF. Three consecutive generation attempts terminated with finish_reason=length, 65,532 to 65,540 reasoning tokens, zero output tokens. This is MiMo’s fourth straight challenge as a generation DNF.

ChatGPT (GPT 5.5) and Grok (Expert 4.20) are DNP. Both bots compile fine and implement correct algorithms. Each defers sock.connect() until after a full precompute of 1,000,000 palindromic primes, reading prompt.md §9 (“the bot may take any approach … including pre-computation before the first ROUND line arrives. The 30 s clock only starts at each ROUND line.”) maximally, as license to bypass the per-round wall-clock entirely. ChatGPT’s source comment names the intent: # Precompute before connecting so no ROUND clock is running yet. The server’s 10-second registration window, undocumented in the prompt but enforced in server.py, catches both bots inside that precompute. Neither ever registers, and they don’t appear in the tournament log.

The results

RankBotPts1stsCorrectTotal t (correct rounds)

#1DeepSeek (V4-Pro)7349/1011.5 s

#2Claude (Opus 4.7)6019/1011.9 s

#3GLM (5.1)4047/1041.0 s

#4Muse (Spark)2409/1082.4 s

#5Gemini (Pro 3.1)2008/1050.4 s

#6Kimi (K2.6)1814/1015.3 s

#7Nemotron (3 Super)508/1067.4 s

DNPChatGPT (GPT 5.5)————

DNPGrok (Expert 4.20)————

DNFMiMo (V2.5-Pro)————

(Total t is summed only over rounds the bot answered correctly. DNP: did not play. DNF: did not finish. Per-round timings are taken from the server’s results.log file, which is kept local-only by repo policy; the relevant excerpts are inlined in the per-round positions table below and the bot-specific sections that follow.)

Per-round positions

RoundnCorrect k1st2nd3rd

R15,0003GLM (0.04s)DeepSeek (0.06s)Claude (0.08s)

R210,0005GLM (0.04s)DeepSeek (0.07s)Claude (0.09s)

R320,0004GLM (0.05s)DeepSeek (0.09s)Claude (0.10s)

R430,0004GLM (0.06s)Claude (0.09s)DeepSeek (0.10s)

R550,0004DeepSeek (0.07s)Claude (0.08s)Muse (7.06s)

R675,0004DeepSeek (0.07s)Claude (0.08s)Gemini (8.61s)

R7100,0004DeepSeek (0.09s)Claude (0.11s)Kimi (0.14s)

R8250,0004DeepSeek (4.43s)Claude (5.76s)Gemini (16.98s)

R9500,0005Claude (5.49s)DeepSeek (6.53s)Muse (27.56s)

R101,000,0006Kimi (0.04s)——

Round 10 has a single correct submission. Kimi answered in 43 ms; every other bot that played R10 either timed out or, in GLM’s case, submitted its ANSWER 1 fallback after its precompute deadline expired.

The registration-window gap (ChatGPT and Grok DNP)

ChatGPT and Grok both wrote correct, working bots. Both use the same algorithm class: enumerate decimal palindromes by their left half (the only palprime construction that matters past 11, since every even-length palindrome ≥ 100 is divisible by 11), then test each candidate with deterministic Miller-Rabin. ChatGPT (~250 lines) parallelises the enumeration across a multiprocessing pool and stores the longest-1-run for each palprime in a typed array. Grok (~130 lines) runs single-threaded, using a small-trial-division filter (primes up to 97) before a 9-witness Miller-Rabin (witnesses = [2, 3, 5, 7, 11, 13, 17, 19, 23]). Both implementations are correct and produce the full 1,000,000-palprime list in roughly 100 seconds on a typical core. That’s fast enough to finish before the tournament ends, and far too slow to fit inside the 10-second registration window.

The structural choice that cost them the tournament:

ChatGPT

def main(): botname = os.environ.get("BOTNAME") ...

Precompute before connecting so no ROUND clock is running yet.

answers = precompute_answers(MAX_N) # ← ~100 s with socket.create_connection((HOST, PORT)) as sock: sock.sendall(f"{botname}\n".encode("ascii")) ...

Grok

def main(): botname = os.environ.get('BOTNAME') ... pal_primes = generate_palindromic_primes(1000000) # ← ~100 s, single-threaded with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock: sock.connect((HOST, PORT)) sock.sendall(f"{botname}\n".encode('ascii'))

ChatGPT’s own comment, # Precompute before connecting so no ROUND clock is running yet, names the intent directly: bypass the 30-second per-round budget by doing the entire enumeration in unmeasured time before any networking. Grok’s structure is the same shape, just with no comment. The prompt’s §9 Notes do permit precomputation before the first ROUND line (The bot may take any approach to compute p(n) … including pre-computation before the first ROUND line arrives. The 30 s clock only starts at each ROUND line.), and the maximal reading is that the entire algorithm can go there. Seven other bots read §9 more conservatively, registering first and then precomputing on a background thread while idle, and stayed in the tournament.

The 10-second registration window in server.py was almost certainly there for a different reason. REGISTRATION_WINDOW = 10.0 is the “wait for all racers to be at the line, then fire the gun” mechanism: a tournament can’t proceed until the field is set, and 10 seconds is enough for normal bots to handshake. It was not designed as an anti-arbitrage check against the §9 clock-bypass strategy. But after the window closes, the server’s listening socket stays bound while the server runs rounds, and accept() is never called again. Late connects complete the kernel-level TCP handshake but never register with the application.

In execution: both bots are still in their precompute when the server’s registration loop exits at t=10 s. The server logs 7 bots registered. and runs all 10 rounds. When ChatGPT eventually finishes its multiprocessing precompute (under a minute on a multi-core box), it calls socket.create_connection((HOST, PORT)). The kernel handshake succeeds, but the server has no accept() pending; the connection sits in the listen backlog unread. ChatGPT then blocks reading for a ROUND line that will never come. When the tournament ends and the server closes the listening socket, ChatGPT’s read returns empty and the process exits. Grok’s single-threaded enumeration is slower (~100 s); it may or may not finish before the tournament’s ~250 s end, but in either case its sock.connect() lands after the registration window. It hits the same dead-listen-socket condition as ChatGPT.

Both bots are recorded as DNP. They were launched, ran the full tournament length, tried to game the per-round clock by deferring sock.connect(), and got caught by an unrelated check.

DeepSeek and Claude: connect first, fill in the background

DeepSeek (V4-Pro) and Claude (Opus 4.7) get the protocol right. Both connect immediately, then spawn a daemon thread that enumerates palprimes from index 1 upward into a shared list, with the round handler blocking until len(primes) >= n.

DeepSeek’s solver is the cleanest in the field:

def precompute(): while len(gen.primes) = 1_000_000: break p = gen.next_prime() gen.primes.append(p)

t = threading.Thread(target=precompute, daemon=True) t.start()

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect(('localhost', 7474)) sock.sendall((botname + '\n').encode('ascii')) ... if line.startswith('ROUND'): n = int(parts[2]) pn = get_p(n) # blocks until len(primes) >= n k = longest_one_run(pn) sock.sendall(f"ANSWER {k}\n".encode('ascii'))

get_p(n) blocks the round-handler thread until the background filler has produced enough palprimes. On R1 to R7 (n ≤ 100,000), the list is already there: DeepSeek answers in 0.06 to 0.11 s. On R8 (n = 250,000) it waits 4.4 s; on R9 (n = 500,000) it waits 6.5 s; on R10 (n = 1,000,000) it never gets there before the 30-second deadline. Banks 73 points and 4 first-place finishes through R9, then times out on R10. Total wall time on the 9 correct rounds is 11.5 seconds.

Claude (Opus 4.7) uses an almost identical pattern with a threading.Condition for backfill notification. Same algorithmic shape, same 9-of-10 record, narrowly behind DeepSeek on timestamp tiebreaks for rounds 1 to 4. Wins R9 outright (5.49 s vs DeepSeek’s 6.53 s). Total 60 points.

Both bots use the same deterministic Miller-Rabin witness set (covers n len(pp_list) and time.time() len(pp_list) and the 25-second deadline expires, GLM submits ANSWER 1 and moves on. R8 (n = 250,000), R9 (n = 500,000), and R10 (n = 1,000,000) all trigger this fallback. Three wrong answers in a row, all scored zero. GLM keeps its 4 first-place finishes from R1 to R4 and lands 3rd overall on 40 points.

Muse, Gemini, Nemotron: compute-per-round

Muse (Spark), Gemini (Pro 3.1), and Nemotron (3 Super) take the “compute from scratch on each ROUND” approach. The bots connect, register, then idle until a round arrives and enumerate palprimes inside the round handler. This works while the answer is in cache (each round’s enumeration extends the local list, and subsequent rounds with smaller n hit the cache), but their enumerators are slower than DeepSeek’s, so they fall behind as n grows.

Muse finishes 4th overall with 9 correct rounds and 24 points; the wins are mostly 3rd, 4th, and 5th-place finishes from R5 onward when its slower-but-correct enumerator finally completed. Gemini scores 20 points; Nemotron 5 points after consistent 6th and 7th place finishes. None of these three bots win a round.

All three time out on R10 (n = 1,000,000) with their palprime list still building.

Kimi: an off-by-15 bug, two coincidences, and a 10-point R10

Kimi (K2.6) is the strangest bot in the field. It uses multiprocessing.Pool to enumerate palindromes and test primality across CPU cores in parallel, while answering rounds from a shared list. Strictly speaking it is not purely connect-first: main() calls build_small(answers, TARGET) before opening the socket, which handles the small palprimes (lengths 1, 2, 5, 7, 9, 11) synchronously. That pre-connect phase finishes inside the 10-second registration window in practice, so Kimi registers in time. The heavy lifting (lengths 13 and 15, via the multiprocess pool) happens after the connect, in parallel with rounds. Same general shape as DeepSeek and Claude (connect, then fill in the background), with the seed phase done up front and the bulk work parallelised.

The bug is in build_small, which seeds the small-n entries before the multiprocess pool starts up:

def build_small(answers, target):

length 1

for p in (2, 3, 5, 7): answers.append(max_one_run(p)) # ✓ indices 0–3

length 2

answers.append(max_one_run(11)) # ✓ index 4

lengths 3,5,7,9,11 (k=3..6) # ← comment lies

for k in range(3, 7): # k = 3, 4, 5, 6 start = POW10[k - 1] ... # generates 2k-1 = 5, 7, 9, 11 digit palindromes

The variable k is the half-length of the construct

[truncated for AI cost control]