The IPv4 Parser AI Couldn't Have Written
This article presents a fast IPv4 parser using SWAR and bit tricks without SIMD, achieving high throughput. The author explains loading IP strings into u64 registers, validating ranges with dynamic masks, and handling dot separators via lookup tables. Benchmarks show 2.60 GB/s throughput on i7-7700K, outperforming many non-SIMD implementations.
The IPv4 Parser AI Couldn't Have Written
2026-06-20
SWAR Zig AI
I was reading Daniel Lemire’s blog, which I highly recommend; specifically: Parsing IP addresses quickly (portably, without SIMD magic).
Reading the title and given his track record, I braced myself for some sort of insane SWAR (SIMD Within A Register) parsing technique, maybe combined with obscure bit twiddling that I’m too stupid to understand without reading it 12 times, or something along those lines.
To my surprise, Lemire showcased a linear scan function mostly written by AI.
Although I agree with his conclusion that he shouldn’t retire just yet, I think we should avoid relying on AI to generate fast code altogether!!
Before diving into my claim, let’s firstly satisfy our hunger for SWAR and bit tricks!
A fast, non-SIMD IPv4 parser
The first thing I do when I approach such problems is to get a feel for the valid inputs. When I think about IPv4, three examples quickly come to mind:
192.168.1.1 255.255.255.255 0.0.0.0
To keep things simple, we are going to treat only the dotted decimal version!
From there, we start noticing some things:
The valid input range for each triplet is [0, 255].
The input length ranges from 7 to 15 characters.
Each string must have 3 dots, ‘.’
Reading input:
Since the input string will always fit into 16 bytes, we can smash it into 2 u64, loading bytes [0..8] in the first u64, which we will call head, and bytes [len - 8..len] in the second u64, which we will refer to as tail.
Unfortunately, with an address.len == 7, we will encounter two problems: the first read will access 1 byte over the input length (since it will try to read until byte 8), and the second read will access 1 byte before the start of our input string, given pointer math.
Luckily, the first problem is an acceptable quirk, and we can simply document it in the API contract, since almost always, strings will be null-terminated or part of a larger buffer, and in both cases, we get a safe 8-byte read!
As for the second problem, we have to adjust our subtraction such that, in case the input length is 7, we don’t overflow or read before the start of the string.
Doing so, the two reads become:
var head: u64 = undefined; @memcpy(std.mem.asBytes(&head), address[0..8]);
var tail: u64 = undefined; const len_is_7 = address.len == 7; const begin = if (len_is_7) 0 else address.len - 8; const end = begin + 8; @memcpy(std.mem.asBytes(&tail), address[begin..end]); tail 0x31, 0x39, 0x32 const input = 0x32_39_31;
// Subtract the ASCII offset (0x30) from all 3 bytes simultaneously // using a single wrapping subtraction (-%). const d = input -% 0x30_30_30; // = 0x32_39_31 - 0x30_30_30 = 0x02_09_01
// Now we need to multiply those values by their positional value // 1 * 100, 9 * 10, 2 * 1 -> * 0x_64_0A_01 const value = d *% 0x640A01;
Our result now resides in the third lane and could easily be extracted with (value & 0x00FF0000) >> 16. However, we can play smarty-pants and use a purpose-built x86_64 instruction introduced with the BMI2 extension: pext.
pext will accept two operands, an input and a mask, and will basically copy the input bits where the mask bits are 1 to the rightmost spot.
Using this, our extraction simply becomes: pext(value, 0xFF0000); the reason to involve such an obscure operation will be clearer later!
Since this function will try to parse everything, invalid triplets like "999" or even non-digit characters, we need a validation step…
Usually, SWAR validation is simply a range check to see if our processed characters fall nicely between 0x00 and 0x09. We can achieve this without looping by exploiting the most significant bit, 0x80, in each lane. If we add 0x76 to a valid digit like 0x09, it becomes 0x7F, leaving that highest bit unset. But the moment a value hits an invalid 0x0A or higher, adding 0x76 pushes it to 0x80 or above, immediately flipping the highest bit to 1.
There is a catch: if an input lane already had its highest bit set from the start (like a non-ASCII character), this addition might cause an overflow that wraps around and clears the bit, hiding the error. To prevent this, we combine the original subtracted register d with the addition result using a bitwise OR. This ensures that if the highest bit was set either before or after the addition, it stays set. Finally, masking with 0x808080 isolates these bits across all three bytes simultaneously. If the result is zero, every single character is a valid digit.
const err = ((d | (d +% 0x767676)) & 0x808080) == 0;
Our case, however, is a bit more convoluted because we need to enforce the strict IPv4 upper bound of 255. The maximum allowed limit for each digit depends entirely on the digits to its left.
Here is the schematic breakdown of these cross-lane dependencies:
Hundreds Lane: └── Allowed range: [0, 2]
Tens Lane: ├── If Hundreds is 0 or 1, Allowed range: [0, 9] └── If Hundreds is 2, Allowed range: [0, 5]
Ones Lane: ├── If Hundreds == 2 AND Tens == 5, Allowed range: [0, 5] └── Otherwise, Allowed range: [0, 9]
Our validation masks then are:
0x76767D = no 2 0x767A7D = 2 present but 5 is not 0x7A7A7D = 2 and 5 both present
To map these cascading boundaries without branching, we can dynamically morph the base mask (0x76767D) by computing precise 0x04 bit adjustments directly from the digits. If the ‘hundreds’ digit is "2", we add 0x04 to the ‘tens’ lane to shift its max bound from "9" to "5" (turning 0x76 into 0x7A). If both ‘hundreds’ is "2" and ‘tens’ is "5", we add 0x04 to the ‘ones’ lane.
const base_mask = 0x76767D; const get_25 = (d +% 0x7B7E) & 0x8080; // Compare the hundreds lane with the tens lane; if the 200 bit is set // then the 50 bit can be left set (if it is on) // Shift the final result to get 0x04 in the right position const adjust = (get_25 & ((get_25 > 40, 0xFF0000FFFF0000FF). The shift is necessary to get the interesting part at the start, since pdep will actually read right-to-left (eheh Little-Endian) to produce the ". 1. 1" layout above. The mask encodes which byte lanes should be populated and which should remain zeroed out. The zeroed lanes are important because a non-zero value in an inactive lane would cause the lane - 0x30 subtraction to trip our validity checks.
The remaining problem is: how do we find the right pdep mask for a given input’s dot configuration? We hash-map it!!
To build the index into that table, we need to locate the dots in the input. We start by XORing the input against 0x2E2E2E2E2E2E2E2E (where 0x2E is the ASCII code for '.'). Thanks to the XOR identity x ^ x == 0, lanes that contain a dot become 0x00. We then subtract 0x0101010101010101 from the result: any lane that was 0x00 will borrow from the lane above, producing 0xFF in that position. Finally, ANDing with 0x8080808080808080 isolates only those high bits, giving us a clean bitmask of the dot positions.
This dot-position mask is then multiplied by a magic constant to map it to a unique slot in a 32-entry lookup table (we only need 32 slots to cover all valid dot configurations). We need to guarantee that no two valid dot masks collide in the map, so a constant needs to be carefully brute-forced into existence :P
var index_high = tail ^ 0x2E2E2E2E2E2E2E2E; index_high -%= 0x0101010101010101; index_high &= 0x8080808080808080; index_high *%= 0x39EDDB77A53AC365; // Magic constant index_high >>= 59; // 32 slots are enough for all valid configurations
Note: Invalid characters may occasionally produce a valid-looking index. We simply let parseTwoTriplets’s existing checks catch those cases.
With the index in hand, we look up both the pdep mask for the lane layout and a shift amount to right-align the significant bytes before scattering them. Since the pdep masks already describe exactly which lanes are active, we can AND them with the parseTwoTriplets subtraction mask at the same time, disabling the inactive lanes so they don’t trip our checks.
Putting it all together:
const high_pdep = lut_high.pdep[index_high]; const high_shift = lut_high.shift[index_high]; const high = parseTwoTripletsHigh(pdep(tail >> @intCast(high_shift), high_pdep), high_pdep);
What do we do now?
Well, now we need to start loving our work the way my parents love me: conditionally, and only if I perform flawlessly.
We need to build a really really good test suite because we shouldn’t trust our work; this is outside the scope of the post, but it should really be a torture test!!
The reason is that even with the best ideas we will miss edge cases. For example, our loading routine treats 255.255.255 the same as 255.255.255.255, making it a valid input :)
A simple fix is to check that address.len - 3 == lut_high[high_index].active_lanes + lut_low[low_index].active_lanes (so we need to add an active lane count to the lookup table).
At this point, we squint super hard to our code and alternative ideas or inefficiencies will surface, for example look at our previous snippet:
var index_high = tail ^ 0x2E2E2E2E2E2E2E2E; index_high -%= 0x0101010101010101; index_high &= 0x8080808080808080; index_high *%= 0x39EDDB77A53AC365; // Magic constant index_high >>= 59; // 32 slots are enough for all valid configurations
The last 3 operations are basically a slightly more complex version of a pext, we gather the top bits, do some shuffling then push ’em to the right; At the cost of a bigger look up table, we could exchange this sequence for a single pext, the final version would be:
inline fn hash(data: u64) usize { var h = data ^ 0x2E2E2E2E2E2E2E2E; h -%= 0x0101010101010101; return @intCast(pext(h, 0x0080808080808000)); }
The same thing is applicable for our adjustments computation in our parseTwoTriplets functions, we can use a look up table instead:
inline fn parseTwoTripletsHigh(chunk: u64, active_lanes: u64) ParseResult { const d = chunk -% (0x3030302E3030302E & active_lanes);
const index = pext(d +% 0x007B7E00007B7E00, 0x0080800000808000); const adjust = adjust_lut_high[index];
const err = (d | (d +% 0x76767D7F76767D7F +% adjust)) & 0x8080808080808080; const compact = d *% 0x640A01;
return .{ .value = @intCast(pext(compact, 0xFF000000FF000000)), .err = err }; }
I’ve decided to keep both those changes because they resulted in increased throughput.
Other than squinting really hard, we copy-paste the algorithm to Compiler Explorer and check the assembly for waste while we fiddle with the code trying to remove it, llvm-mca could also help finding bottlenecks.
Those last steps need to be repeated until nothing more can be found and the algorithm passes all tests. The presented version of the algorithm went through that exact process at least 6 times.
Is that at least actually fast??
I’ve put up a benchmark that generates 1.5M addresses, parses them with our function, and here are the results that I got on my i7700k:
// Governor set to performance ❯ sudo taskset -c 3 chrt -f 99 ./benchmark info: Parsed 1500000 random IPs 64 times (1274.88MB) in 490992976 ns info: Average latency: 5.11 ns/ip info: Throughput: 2.60 GB/s
We can also repurpose Robert Graham’s benchmark. These are the results for the 1.5M addresses input size (I’ve removed the SSE version since I couldn’t get a valid checksum):
[ ] freq time cycl inst ipc brch miss l1d checksum [ ai ] 4.5-GHz 20.5-ns 92 153 1.7 38 1.8 0.3 [0x00000000] [ swar ] 4.5-GHz 24.4-ns 109 365 3.3 3 0.0 0.3 [0x00000000] [ from ] 4.5-GHz 31.0-ns 139 362 2.6 71 1.7 0.3 [0x00000000] [ dfa ] 4.5-GHz 35.7-ns 160 317 2.0 34 0.8 0.3 [0x00000000] [ fsm ] 4.5-GHz 36.5-ns 164 371 2.3 107 1.7 0.3 [0x00000000] [ fsm2 ] 4.5-GHz 30.9-ns 139 308 2.2 84 2.0 0.3 [0x00000000] [ bmi ] 4.5-GHz 7.1-ns 32 102 3.2 6 0.0 0.3 [0x00000000]
Note: This benchmark disfavors us since it passes a pointer to the start of the address and a maximum readable length, but not the actual length of the IP; we needed to find the end of the IP before calling our routine.
As both benchmarks show, our functio
[truncated for AI cost control]