AI-assisted binary patching to fix an abandoned router's DHCP bug
A detailed account of using AI-assisted analysis to create an 8-byte binary patch for EdgeOS dhcrelay3, fixing an RFC 2131 violation that caused duplicate DHCP packets to flood a central server. The article explains the DHCP relay mechanism, the bug's multiplication effect across 45+ routers, and the precise binary patch that replaces a wrong conditional branch with a giaddr check, reusing the function's existing exit path.
An 8-byte, AI-assisted binary patch that stops EdgeOS dhcrelay3 from re-relaying already-relayed DHCP packets, an RFC 2131 violation.
~/blog $ cat edgeos-dhcrelay-binary-patch.md · Dax Kelson ·June 29, 2026 · 12 min read
A centralized DHCP server on an ISP network I was working on was logging about 200 duplicate request packets a second. The cause was EdgeOS's relay daemon, dhcrelay3, re-relaying packets that had already been relayed once, a violation of RFC 2131. The network runs more than 45 EdgeOS routers and the relay paths are several levels deep, which is what turned a single client request into a steady flood at the center. I fixed it with an 8-byte patch to the shipped dhcrelay3 binary. After it was deployed, the duplicates stopped: the server went from roughly 200 a second to zero.
TL;DR: one wrong branch in do_relay4() re-relays already-relayed packets. Overwrite eight bytes so it jumps to the function's existing return instead. On the Octeon (big-endian) routers:
$ cp /usr/sbin/dhcrelay3 dhcrelay3.patched $ printf '\x14\xc0\xff\x56\x8f\xbf\x00\x94' | \ dd of=dhcrelay3.patched bs=1 seek=$((0xCF38)) count=8 conv=notrunc
Everything below is how those eight bytes were found.
Why DHCP needs a relay, and how it works
A device that does not yet have an IP address asks for one by broadcasting a DHCP DISCOVER. Broadcasts stay on the local subnet: they travel the local datalink and a router does not forward them. On a small network the DHCP server sits on that same subnet and answers directly. Large networks commonly centralize DHCP on a single server for easier management, often on its own subnet, so addressing is administered in one place rather than configured on every router. That server is not on the client's subnet, so the broadcast never reaches it by itself.
A DHCP relay bridges that gap. It runs on the local router, listens for the broadcast, and forwards the request to the configured server as an ordinary unicast packet. Before forwarding, it fills in the request's giaddr field, the "gateway IP address", with the address of the interface that received the request. That one field does two jobs: it tells the server which subnet the client is on, so the server leases an address from the right pool, and it gives the server somewhere to send the reply. The reply comes back to the relay, and the relay hands it to the client.
%%{init: {'theme':'base','fontFamily':'Open Sans, sans-serif','themeVariables':{'fontFamily':'Open Sans, sans-serif','fontSize':'16px','lineColor':'#00609A','edgeLabelBackground':'#FFFFFF','background':'#FFFFFF'},'flowchart':{'htmlLabels':true,'useMaxWidth':true,'nodeSpacing':55,'rankSpacing':70,'curve':'basis'}}}%% flowchart LR C["client
needs a lease, has no IP yet
"] R["DHCP relay
runs on the local router
"] S["central DHCP server"] C -->|"1. broadcast DISCOVER, stays on the local subnet"| R R -->|"2. unicast to the server, giaddr = the local subnet"| S classDef infra fill:#EAF2FB,stroke:#0078C1,stroke-width:2px,color:#003C60 classDef accent fill:#FBE3B3,stroke:#ED8C0C,stroke-width:2.5px,color:#5A3A00 class C,S infra class R accent linkStyle default stroke:#00609A,color:#1F2937
Broadcasts (step 1) never cross the router, so the relay (amber) is the only thing that carries the request to a central server (step 2). The giaddr it stamps tells the server which subnet to lease from and where to send the reply.
Once a relay has stamped giaddr, the packet is marked as already handled. RFC 2131 section 4.1.1 is explicit about what every other relay should then do with it: nothing. A request that already carries a giaddr has been relayed by another agent, so the right move is to leave it alone and let normal routing carry it to the server. EdgeOS's build only left a packet alone when its giaddr matched one of the relay's own addresses. Every other already-relayed packet it relayed a second time, which is where the loop starts.
How one DISCOVER becomes a flood
The loop needs at least two relays in the path. The first relay, R1, does the right thing: it stamps giaddr and forwards the request once toward the server. The trouble starts at the next relay the packet transits. R2 should simply route that already-stamped packet onward, but instead it re-relays it, sending a second copy toward the server with hops bumped by one. Now two packets are in flight, and each transits the relays further up the path, every one of which re-relays every copy it sees. So the copies multiply at each affected hop instead of merely adding up: a copy made by R2 gets re-relayed by R3 and R4, and the copies R3 makes are themselves re-relayed again upstream. The only brake is the BOOTP hops field, which each re-relay increments; a packet is dropped once it reaches the cap of 16. That cap bounds how deep a re-relay chain can run, not how wide the fan-out gets, so one client DISCOVER can reach the server as many more than sixteen copies, all sharing a single transaction ID.
EdgeOS compounds it: its starter script puts every relay interface in both the receive and the forward role, so a router can pick up the copy it just sent and relay that one too. Stack more than 45 such routers several levels deep around a single server, and the multiplication adds up to the steady 200 duplicate requests a second the server was logging.
%%{init: {'theme':'base','fontFamily':'Open Sans, sans-serif','themeVariables':{'fontFamily':'Open Sans, sans-serif','fontSize':'14px','lineColor':'#00609A','edgeLabelBackground':'#FFFFFF','background':'#FFFFFF'},'flowchart':{'htmlLabels':true,'useMaxWidth':true,'nodeSpacing':30,'rankSpacing':42,'curve':'basis'}}}%% flowchart TD C["client — 1 DISCOVER"] R1["relay R1 (correct)
relays once · hops 1
"] A["next affected relay re-relays
hops 2
"] B1["re-relayed again
hops 3
"] B2["re-relayed again
hops 3
"] D["… every copy is re-relayed by the next affected relay, until each branch reaches the hop cap (16)"] S["central DHCP server
sees the same request many times over
"] C --> R1 R1 --> A A --> B1 A --> B2 B1 --> D B2 --> D D --> S classDef infra fill:#EAF2FB,stroke:#0078C1,stroke-width:2px,color:#003C60 classDef deny fill:#FBE0E0,stroke:#C0392B,stroke-width:2px,color:#7A1C1C class C,R1,S infra class A,B1,B2,D deny linkStyle default stroke:#00609A,color:#1F2937
Each affected relay re-relays every copy it sees, so the duplicates multiply at each hop instead of adding up (the diagram shows the branching shape, not an exact factor). The hop cap of 16 bounds how deep a re-relay chain runs, not how wide it fans out. Across the 45-router network this totaled about 200 duplicate requests a second, and zero after the patch.
A capture at the server shows the same request arriving again and again with a climbing hops count (note the two separate copies at hops 2, the multiplication starting):
.305226 In giaddr= hops=1 R1 makes the initial relay .305256 Out giaddr= hops=1 normal routing .305522 Out giaddr= hops=2 re-relayed by the next router (the bug) .305765 Out giaddr= hops=2 picked up again on egress (the bug) .305995 Out giaddr= hops=3 and on, multiplying until the hop cap
dhcrelay3 is open source, but the binary is frozen
dhcrelay3 is not closed-source mystery code. It is the relay agent from ISC DHCP, the long-running DHCP suite from the Internet Systems Consortium (ISC), which is open source and now end-of-life (ISC has moved on to its Kea successor). The fix is no secret either. In ISC's source the guard is a single line near the top of the do_relay4() function:
if (packet->giaddr.s_addr) return;
The difficulty is not knowing the fix, it is applying it. EdgeOS ships a stripped, cross-compiled MIPS build of an old ISC version, and there is no supported way to rebuild the daemon and reinstall it on these end-of-life routers. So the one-line change has to go into the binary that is already on the box.
You do not need to read MIPS assembly to follow the rest; the short version is that the compiled code runs a different check where that giaddr test belongs, and the patch swaps it. Here is how you find the spot. The function logs a message, "Dropping request received on %s", and that text is easy to locate in the binary; its address pins the surrounding code. Just before the log call sits the check the compiled daemon actually runs, an interface-flag test rather than a giaddr test:
cf30: lw v0,144(s2) load this interface's flags cf34: andi v0,v0,0x8 isolate the "downstream" bit cf38: beqz v0,d224 if it is not set, branch away cf3c: lw a0,-32712(gp) (an unrelated load, overwritten later)
objdump with the right byte-order flag (-EB for the Cavium Octeon routers, -EL for the MediaTek MT7621 one) produces that listing. To trace a value across instructions by hand, an interactive reverse-engineering tool helps: radare2 (an open-source disassembler framework) or Ghidra (the open-source disassembler and decompiler the NSA released) both let you load the binary and follow a register through the code.
The patch: jump to an exit the function already has
The whole fix replaces that interface check with a giaddr test that bails out. In plain terms: if the packet already carries a giaddr, send it straight to the function's existing "return and do nothing more" exit, so it is never relayed a second time. Two instructions at one spot change:
cf38: bnez a2,cc94 giaddr not zero? jump to the exit at cc94 cf3c: lw ra,148(sp) restore the return address
Three things make the change safe, and none of them require new code. First, the value the patched instruction tests, giaddr, is already sitting in a register (a2): the function loaded it earlier and nothing overwrites it before this point. Second, the jump target, 0xCC94, is not somewhere new; it is the function's existing exit sequence, the same return path other parts of the function already use, so reusing it shifts no addresses and adds nothing to the file. Third, the jump distance is arithmetic the assembler would have computed anyway: (0xCC94 - (0xCF38 + 4)) / 4 = -170, encoded into the instruction bytes.
%%{init: {'theme':'base','fontFamily':'Open Sans, sans-serif','themeVariables':{'fontFamily':'Open Sans, sans-serif','fontSize':'16px','lineColor':'#00609A','edgeLabelBackground':'#FFFFFF','background':'#FFFFFF'},'flowchart':{'htmlLabels':true,'useMaxWidth':true,'nodeSpacing':50,'rankSpacing':56,'curve':'basis'}}}%% flowchart TD L["earlier in do_relay4()
giaddr is loaded into a register and is still set at the patch site
"] L --> P P["the patched instruction
at 0xCF38: is giaddr non-zero? if so, jump to the existing exit
"] P -->|"giaddr = 0"| F["new request
fall through: relay it once
"] P -->|"giaddr ≠ 0"| E["existing exit (0xCC94)
code the function already has: return without re-relaying
"] E --> K["
kernel routes it to the DHCP server, once
"] classDef infra fill:#EAF2FB,stroke:#0078C1,stroke-width:2px,color:#003C60 classDef accent fill:#FBE3B3,stroke:#ED8C0C,stroke-width:2.5px,color:#5A3A00 class L,F,E,K infra class P accent linkStyle default stroke:#00609A,color:#1F2937
The patched instruction (amber) sends already-relayed packets to the exit at 0xCC94, a return the function already contains, so the fix adds no new code. New requests fall through and are relayed exactly once.
One MIPS quirk is worth a sentence, because it is the kind of thing that turns a tidy patch into a crash. On MIPS the instruction immediately after a jump always runs, taken or not (the "delay slot"). The instruction the patch puts there, lw ra,148(sp), restores the return address, which is exactly the first thing the 0xCC94 exit does anyway, so it is correct whether the jump is taken or the code falls through. The patched bytes line up with the existing return pattern instead of fighting it.
The same logic ports to the other supported routers, but the bytes do not: a different processor needs a different encoding, not a different offset. The repo c
[truncated for AI cost control]