excursions in kernel optimization


background

vector search is supposed to be bandwidth-bound.

the arithmetic intensity is low—each byte fetched from memory gets used once, maybe twice. you spend more time moving data than operating on it.

quantization breaks this. compress vectors hard enough and the system flips from chasing memory bandwidth to starving for compute. most vector search systems never escape the bandwidth trap.

as per their latest blog, turbopuffer did, and now their bottleneck is a tight loop counting bits.

i wanted to understand what that actually feels like. so i wrote the kernel for it.

the theory

RaBitQ is a quantization scheme for approximate nearest neighbor search. the full method involves normalization, random orthogonal transforms, and deriving an unbiased distance estimator with provable error bounds. i'll focus on what matters for the kernel.

the setup. each data vector gets quantized into a D-bit string $\bar{x}_b$—one bit per dimension, packed into u64s. the query vector gets quantized into 4-bit unsigned integers, then decomposed into four bit-planes $\bar{q}_u^{(j)}$, each also D bits. equation 22 of the paper gives the core computation:

$$\langle \bar{x}_b, \bar{q}_u \rangle = \sum_{j=0}^{3} 2^j \cdot \langle \bar{x}_b, \bar{q}_u^{(j)} \rangle$$

each term $\langle \bar{x}_b, \bar{q}_u^{(j)} \rangle$ is a binary inner product: POPCOUNT($\bar{x}_b$ AND $\bar{q}_u^{(j)}$). the whole thing becomes a weighted sum of popcounts—one per bit-plane, weighted by powers of two.

the estimator. the raw inner product isn't a distance yet. RaBitQ derives an unbiased estimator that corrects for quantization error:

$$\bar{x} \cdot \bar{q} = (A \times \text{ip} + B \times \text{popcount} + C) \times \text{inv\_norm}$$

where:

the key insight: all the expensive stuff—square roots, divisions, query-dependent terms—can be hoisted out of the inner loop. the hot path is pure integer: 4 ANDs, 4 POPCOUNTs, shifts, adds. floating-point only touches the ~1% of candidates that survive to reranking.

the kernel

scalar baseline

i started with a simple scalar implementation to serve as a constant point of reference:

pub fn compute_inner_product_estimator(
    data: &QuantizedDataVector,
    query: &QuantizedQuery,
    dim: usize,
) -> f32 {
    // Core: sum(j=0 to 3) 2^j * popcount(data AND bit_plane[j])
    let mut ip: u32 = 0;

    for j in 0..4 {
        let weight = 1u32 << j;
        let mut dot: u32 = 0;

        for (d, q) in data.bits.iter().zip(query.bit_planes[j].iter()) {
            dot += (d & q).count_ones();
        }

        ip += weight * dot;
    }

    // Apply formula
    let sqrt_d = (dim as f32).sqrt();
    let inv_sqrt_d = 1.0 / sqrt_d;

    let x_bar_dot_q_bar = (2.0 * query.delta * inv_sqrt_d) * (ip as f32)
        + (2.0 * query.v_l * inv_sqrt_d) * (data.popcount as f32)
        - (query.delta * inv_sqrt_d) * (query.query_sum as f32)
        - sqrt_d * query.v_l;

    x_bar_dot_q_bar / data.o_bar_dot_o
}

scalar baseline 1000 vectors @ dim1024 on c6i.large: 21.4 ns/vector. the compiler auto-vectorizes some of this, which is why it's not terrible - but we can do better.

AVX-512 single-threaded

the key unlock is AVX-512's VPOPCNTDQ instruction: popcount across a 512-bit register in 3 cycles latency, pipelined at 1 per cycle. turbopuffer's blog called out this exact instruction—AVX2 doesn't have hardware popcount for vectors, which is why they made the switch.

struct-of-arrays, not array-of-structs. the naive layout groups each vector with its metadata (popcount, inv_norm, etc.). since the hot path only touches the bits, the better layout separates them—all bits contiguous, metadata elsewhere. no point polluting cache lines with bytes you won't read.

AoS: [v0_bits | v0_meta] [v1_bits | v1_meta] ...
SoA: [v0_bits, v1_bits, ...] [v0_meta, v1_meta, ...]

query pinned in registers. the query's four bit-planes occupy 8 zmm registers for the entire batch. data streams through; query stays put. no reloading, no cache pressure from the query side.

integer-only hot path. the loop is pure integer: load, and, popcount, shift, add. the floating-point postprocessing only runs on candidates that need exact distances.

prefetch 16 vectors ahead. tuned empirically for L3 latency at this working set size. see graph below.

ns/vector
    |
6.0 | *  (no prefetch)
    |  \
5.5 |   \
    |    * 4
5.0 |     \
    |      * 8
    |       \
    |         *---*---*---*
4.5 |         12  16  24  32
    |              ^
    |            knee  
4.0 |            
    |
    +---------------------------
              prefetch distance (vectors)

single-threaded result 50K vectors @ dim1024: 4.83 ns/vector, 207M vectors/sec on c6i.large

consistent across batch sizes from 1K to 100K—only 3% degradation.

the cache hierarchy is doing its job.

dead ends

multi-query batching. the theory: process multiple queries simultaneously, amortize data loads across them. the reality: 50K vectors × 128 bytes = 6.4MB fits comfortably in L3. there's no memory bandwidth to save—the data's already there.

unrolling for ILP. processing two vectors per iteration to hide latency. no improvement—already saturating the execution port that handles VPOPCNTDQ. more parallelism doesn't help when you're bottlenecked on a specific functional unit.

the dead ends were useful. they forced me to locate the actual bottleneck instead of pattern-matching optimizations.

multi-threading

first attempt: c6i.large, 2 vCPUs. parallelism made it slower—2 threads took 37% longer than 1. initially i blamed thread overhead. that wasn't quite right.

c6i.large has 2 hyperthreads on 1 physical core. they share the same VPOPCNTDQ execution port. i wasn't parallelizing—i was adding contention for a single functional unit.

second attempt: c6i.2xlarge, 4 physical cores. now each thread gets its own execution units. 4 threads hit 2.17 ns/vector—a 2.2× speedup over single-threaded. hyperthreading beyond physical core count hurt performance—classic compute-bound SIMD behavior.

third attempt: c6i.4xlarge, 8 physical cores. 8 threads hit 1.89 ns/vector. but so does 4 threads. in fact, going from 4 to 8 threads led to no improvement. the bottleneck shifted.

at 1.89 ns/vector, each thread pulls 128 bytes / 1.89 ns ≈ 68 GB/s from L3. we've hit practical L3 saturation. adding more threads doesn't help because we're no longer compute-bound—we're memory-bound again, just at a different level of the hierarchy. mesh interconnect contention and prefetch traffic mean practical saturation hits well before the spec sheet limit.

i optimized compute so hard that memory became the bottleneck. the system came full circle.

results

50K vectors @dim1024

metricvalue
time per vector1.89 ns
throughput529M vectors/sec
bottleneckL3 bandwidth (~68 GB/s)
vs. scalar~11.3x
scalar (c6i.large):                21.4 ns    baseline
AVX-512 single (c6i.large):         4.8 ns    4.4x  (compute-bound)
+ 4 threads (c6i.2xlarge):          2.2 ns    9.7x  (compute-bound)
+ 4 threads (c6i.4xlarge):         1.89 ns   11.3x  (L3-bound) ← ceiling
+ 8 threads (c6i.4xlarge):         1.89 ns   11.3x  (no improvement)

code

https://github.com/arighosh05/rabitq-kernel