Q8K128 for Phi-3 Mini: A Rust Quantization Experiment

Implemented Q8K128 format

Validated binary layout

Built histogram gate

Ran same-corpus PPL

Published commands

Experiment pipeline

1 Hypothesis Smaller scale domains may reduce local quantization error.

2 Format design Two 128-weight Q8 blocks represent each 256-weight span.

3 Strict packing Magic, dtype, length, shape, and block counts fail loudly.

4 Histogram screening RMSE, MAE, max error, zero collapse, and spiky blocks.

5 PPL validation Same-corpus WikiText-2, ctx 2048, 30 scored chunks.

6 Decision No SIMD work without a model-level quality signal.

Table of Contents

1. Executive Summary
2. Why Try Q8K128?
3. Experiment Design
4. Rust Block Layout
5. Strict File Format and Packing
6. Loader and Scalar Matmul
7. Histogram Gate
8. Perplexity Results
9. Engineering Verdict
10. Reproduce the Run

1. Executive Summary

The baseline project already had a practical mixed quantization setup for Phi-3 Mini: attention, gate/up, and lm_head weights use Q8K, while MLP down-projections use Q4K. Q8K128 was tested as a small, controlled format change: keep Q8K arithmetic, but reduce each block from 256 weights to 128 weights so every scale covers a narrower local range.

The result is useful, but not in the way I hoped. Q8K128 improved local reconstruction error on every qkv layer, but it did not improve WikiText-2 perplexity. The all-qkv Q8K128 run was essentially neutral on quality and far slower because the first implementation deliberately used scalar Rust rather than SIMD.

Local win

8.9%

weighted qkv RMSE improvement after halving block width.

Broad signal

32/32

qkv layers improved in reconstruction metrics.

PPL neutral

7.2670

all-qkv Q8K128 PPL versus 7.2652 baseline.

Tradeoff

2.0

tok/s for scalar all-qkv path, so SIMD was paused.

Main conclusion

Q8K128 Variant A should not be optimized further yet. It improves reconstruction metrics, but the PPL result does not justify SIMD work or broader integration. The experiment is still valuable because it shows where reconstruction-error histograms are predictive and where they are not enough.

My role in this experiment

I designed and implemented the Rust Q8K128 block format, feature-gated quantizer route, packer and loader checks, scalar inference path, histogram workflow, benchmark run, and decision criteria for stopping before SIMD work.

Code and benchmark artifacts are in the public repository: github.com/artem1984A/nibble. The most relevant Rust files are quant_q8k_128.rs, quantize_q8k128.rs, pack_q8k_safetensors.rs, and quant_linear.rs.

2. Why Try Q8K128?

Q8K stores a block scale plus 256 signed 8-bit weights. This is already a high-quality format, but it has one structural weakness: one large outlier inside the block determines the scale for all 256 values. When most values are small and one value is large, the small values get fewer effective quantization levels.

Q8K baseline

one scale covers 256 weights

[i8; 256]

Compact and already optimized in the baseline matmul path.

Q8K128 variant

two local scales cover the same 256 weights

[i8; 128]

Slightly larger, but less exposed to one outlier dominating a full block.

This is a conservative idea. It does not change the quantization math, it does not add a second scale, and it does not introduce a new 6-bit or 4-bit packing scheme. It only asks whether smaller local scale domains recover enough accuracy to matter.

Why not start with SIMD?

The rule for this experiment was intentionally strict: first prove that PPL improves, then optimize. Writing a fast SIMD kernel before a quality win would make the code more complex without proving that the format deserves to stay.

3. Experiment Design

The implementation followed a small blast-radius plan. Q8K128 lives behind --features experimental-q8k128, and the quantizer can route only selected tensors into the new format through CANDLE_Q8K128_POLICY. This allowed several scopes to be tested without disrupting the baseline Q8K/Q4K pipeline.

Experiment flow

Original Phi-3 shards SafeTensors BF16/F32 checkpoint

Quantize by policy Q8K128, Q8K, Q4K, or preserved F32

Pack strictly format keys plus metadata tensors

Histogram gate RMSE, MAE, max error, spiky blocks

PPL check WikiText-2, ctx 2048, 30 chunks

Policy	Meaning	Reason to test
`layer0-qkv`	Only `model.layers.0.self_attn.qkv_proj.weight` uses Q8K128.	Smallest possible PPL smoke test for the strongest outlier signal.
`qkv`	All 32 qkv projections use Q8K128.	Tests whether the broad histogram improvement becomes a model-level quality win.
`attn`	All qkv and attention output projections use Q8K128.	Available for follow-up, but not justified after the all-qkv PPL result.
`q8k` / `all`	Replace all baseline Q8K-routed tensors, or every quantized target tensor.	Broader policies exist for completeness, but would be expensive without a PPL signal.

4. Rust Block Layout

The core type is deliberately plain: a C-compatible block with one f32 scale, 128 signed quantized weights, and eight 16-value partial sums. The block derives bytemuck::Pod and Zeroable, so it can be written and read as raw bytes once the layout has been locked down.

quant_q8k_128.rs - C-compatible block layout

#[repr(C)]
#[derive(Clone, Copy, Pod, Zeroable)]
pub struct BlockQ8K128 {
    pub d: f32,
    pub qs: [i8; QK_Q8K_128],
    pub bsums: [i16; QK_Q8K_128 / 16],
}

const _: () = assert!(std::mem::size_of::<BlockQ8K128>() == 148);
const _: () = assert!(std::mem::align_of::<BlockQ8K128>() == 4);

Why this matters: the layout assertions make raw SafeTensors bytes auditable before inference can touch them.

The size is 148 bytes: 4 bytes for d, 128 bytes for qs, and 16 bytes for eight i16 sums. Two Q8K128 blocks represent the same 256 weights that one Q8K block represents, so the storage cost increases slightly:

Format	Weights per scale	Bytes per 256 weights	Expected effect
Q8K	256	292	Compact and already fast in the baseline path.
Q8K128	128	296	Two scales per 256 weights, slightly larger file, potentially lower local error.

Quantize and dequantize

The quantization function is intentionally close to baseline Q8K: find absolute max, divide by 127, round into signed 8-bit integers, and precompute small group sums. That made the experiment easy to audit and easy to compare against the existing Q8K path.

quant_q8k_128.rs - row block encoding

pub fn from_float_row(src: &[f32]) -> Self {
    debug_assert_eq!(src.len(), Self::QK);
    let amax = src.iter().fold(0f32, |m, &v| m.max(v.abs()));
    let d = if amax > 0.0 { amax / 127.0 } else { 1.0 };
    let inv_d = 1.0 / d;

    let mut qs = [0i8; QK_Q8K_128];
    for (q, &w) in qs.iter_mut().zip(src.iter()) {
        *q = (w * inv_d).round().clamp(-127.0, 127.0) as i8;
    }

    let mut bsums = [0i16; QK_Q8K_128 / Self::BSUM_GROUP];
    for (idx, chunk) in qs.chunks_exact(Self::BSUM_GROUP).enumerate() {
        let sum: i32 = chunk.iter().map(|&q| q as i32).sum();
        bsums[idx] = sum as i16;
    }

    Self { d, qs, bsums }
}

Why this matters: the variant changes scale locality, not the basic Q8 arithmetic, so the experiment stays easy to compare.

5. Strict File Format and Packing

Quantized inference has a failure mode that normal Rust code usually does not: a wrong binary layout can still deserialize and run, but produce subtly wrong logits. For this reason the writer, packer, and loader were made strict. Wrong magic, wrong dtype, wrong byte length, impossible shape, or mismatched block count should fail loudly.

Versioned header

Q8KHeader carries magic, version, dimensions, blocks-per-row, and dtype.

Compile-time layout assertions

The Q8K128 block must remain exactly 148 bytes and 4-byte aligned.

Single source per tensor

The packer rejects stale mixed-policy directories with multiple quantized files for one tensor.

Per-layer logging

Inference logs whether each loaded tensor is Q8K128, Q8K, Q6K, Q4K, or preserved.

quantize_q8k128.rs - versioned writer header

let hdr = Q8KHeader {
    magic: MAGIC_Q8K_128,
    version: HEADER_VERSION,
    out: rows as u32,
    k: k as u32,
    blocks_per_row: (k / QK_Q8K_128) as u32,
    dtype: DTYPE_Q8K_128,
};

Why this matters: strict metadata turns silent binary drift into a load-time failure.

The packer writes two SafeTensors entries for a Q8K128 tensor: name.q8k128 contains raw block bytes, and name.q8k128_meta stores [out, k]. Q8K128 is checked before Q8K during packing and loading, so a selected tensor gets the experimental format while untouched tensors continue to use the baseline.

Why this strictness matters

A block count error in a quantized model may not crash immediately. It can shift bytes, change the scale for later weights, and only appear as unexplained PPL drift. The safest implementation is one where malformed files cannot reach inference.

6. Loader and Scalar Matmul

The inference side extends the existing QuantBlocks enum with a feature-gated Q8K128 variant. The loader checks metadata dtype and raw byte length before casting bytes into BlockQ8K128. Once loaded, the existing linear layer dispatch can call the Q8K128 scalar matmul for those tensors only.

quant_linear.rs - feature-gated matmul dispatch

match &self.blocks {
    QuantBlocks::Q8K(blocks) => matmul_q8k(...),
    QuantBlocks::Q4K(blocks) => matmul_q4k(...),
    #[cfg(feature = "experimental-q8k128")]
    QuantBlocks::Q8K128(blocks) => {
        quant_q8k_128::matmul_scalar((batch, k, out), x, blocks, y)?
    }
}

Why this matters: the experimental path stays isolated from the stable mixed Q8K/Q4K route.

This scalar matmul is intentionally boring: for each output row, iterate over 128-value blocks, accumulate q * x, then multiply by the block scale. That makes the result useful for PPL validation, but it is not competitive with optimized Q8K inference.

A useful implementation boundary

The scalar path is not a final kernel. It is a measurement tool. If PPL had improved, the next step would be a dedicated SIMD kernel. Since PPL did not improve, the scalar implementation did its job and prevented unnecessary low-level work.

7. Histogram Gate

Before spending hours on PPL runs, the experiment used an offline histogram tool. It reads original SafeTensors weights, quantizes tensors in memory, dequantizes them back to F32, and writes per-layer reconstruction metrics: RMSE, MAE, max absolute error, relative error, zero-collapse counts, and spiky-block counts.

The strongest signal appeared in qkv projections. Comparing Q8K128 against Q8K across all qkv layers showed a broad local improvement:

Metric	Q8K128	Q8K	Change
Weighted qkv RMSE	`2.763032e-04`	`3.033601e-04`	8.9191% lower
Layers improved	32	0	All qkv layers
Layer 0 qkv RMSE	`2.846937e-04`	`3.476770e-04`	18.1154% lower

Histogram decision

This was enough to justify one all-qkv PPL run. The improvement was not isolated to layer 0, so testing CANDLE_Q8K128_POLICY=qkv was reasonable.

8. Perplexity Results

The PPL comparison used a recreated plain-text WikiText-2 test file with the same corpus for the current Q8K/Q4K baseline and Q8K128 runs. The file had 1,256,449 bytes and 337,885 tokens. The benchmark scored 30 chunks at context 2048, for 61,410 scored tokens.

Variant	Q8K128 scope	Mean NLL	PPL	Delta vs baseline	Throughput
Mixed Q8K/Q4K baseline	none	`1.983096`	7.2652	-	`16.9 tok/s`
Q8K128 + Q8K/Q4K	layer 0 qkv only	`1.984034`	7.2720	+0.0068	`14.2 tok/s`
Q8K128 + Q8K/Q4K	all qkv projections	`1.983343`	7.2670	+0.0018	`2.0 tok/s`

Running perplexity — per scored chunk (WikiText-2, ctx 2048, 30 chunks)

Each point is the cumulative running PPL after scoring that chunk. The dashed grey line is the same-corpus Q8K/Q4K baseline (7.2652).

The all-qkv result is close enough to baseline that it should be treated as neutral quality, not a win. More importantly, it is not lower than the baseline. Since the scalar Q8K128 path was much slower, the practical decision is clear: do not optimize this variant until another experiment shows a real PPL gain.

Important reproducibility caveat

Earlier Q6K and permutation runs used an older /tmp/wt2.txt. The Q8K128 rows above should only be compared with the same-corpus Q8K/Q4K baseline shown in the table. The benchmark README in the repository records this explicitly.

What this proves

Q8K128 reduces local qkv reconstruction error across all measured qkv layers.
The histogram workflow is useful for deciding which PPL run is worth paying for.
The strict packer and loader checks make experimental binary formats safer to evaluate.

What it does not prove

It does not show a model-level quality win over the same-corpus Q8K/Q4K baseline.
It does not justify writing a dedicated SIMD kernel for this variant yet.
It does not mean smaller blocks everywhere are the next best quantization strategy.

9. Engineering Verdict

Q8K128 Variant A is a good example of a negative result that still improves the engineering process. The local metric behaved exactly as expected: smaller blocks reduce reconstruction error. But model-level perplexity did not improve. That means the current qkv reconstruction error was not the limiting factor for this mixed-quantized Phi-3 setup, or the improvement was too small and too local to affect next-token prediction.

Question	Answer
Did Q8K128 reduce reconstruction error?	Yes. Broad qkv RMSE improvement, including all 32 qkv layers.
Did it improve PPL?	No. Same-corpus PPL moved from 7.2652 to 7.2670 for all qkv.
Should SIMD be implemented for this variant?	No. The scalar path is slow, but quality does not justify kernel work.
Was the experiment worth doing?	Yes. It validated the histogram workflow and hardened the binary format pipeline.

What I would try next

The next useful direction is not simply "smaller Q8K blocks everywhere." I would first look for tensors where PPL is sensitive to the error, not just tensors where RMSE improves. Possible next experiments include activation-aware routing, selective preservation of a small number of outlier columns, or a format that handles outliers separately instead of only shrinking the block.

10. Reproduce the Run

The implementation lives in the phi3_standalone directory. Build with the experimental feature:

Reproduce - build experimental binaries

cd phi3_standalone
cargo build --release --features experimental-q8k128 \
  --bin quantize_q8k128 --bin pack_q8k_safetensors --bin perplexity

Quantize all qkv projections into Q8K128 while keeping the rest of the mixed Q8K/Q4K policy:

Reproduce - quantize and pack all qkv projections

SNAP=~/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/<rev>

CANDLE_Q8K128_POLICY=qkv \
  ./target/release/quantize_q8k128 "$SNAP/model-00001-of-00002.safetensors" ./quantized-q8k128-qkv
CANDLE_Q8K128_POLICY=qkv \
  ./target/release/quantize_q8k128 "$SNAP/model-00002-of-00002.safetensors" ./quantized-q8k128-qkv

./target/release/pack_q8k_safetensors \
  "$SNAP/model-00001-of-00002.safetensors" ./quantized-q8k128-qkv ./packed-shard1-q8k128-qkv.safetensors
./target/release/pack_q8k_safetensors \
  "$SNAP/model-00002-of-00002.safetensors" ./quantized-q8k128-qkv ./packed-shard2-q8k128-qkv.safetensors

Then run the PPL benchmark against the same corpus:

Reproduce - same-corpus PPL run

PHI3_PPL_CTX=2048 PHI3_PPL_MAX_CHUNKS=30 \
  ./target/release/perplexity \
    ./packed-shard1-q8k128-qkv.safetensors \
    ./packed-shard2-q8k128-qkv.safetensors \
    /tmp/wt2.txt \
  | tee ppl_results/wikitext2_q8k128-qkv_ctx2048_chunks30_$(date +%Y%m%d-%H%M%S).log

Source and Results

The repository contains the Rust implementation, the strict packer/loader logic, the histogram tooling, and the benchmark logs used for this article.

Open GitHub Project View PPL Results Read Baseline Article

Open-source foundations

Thanks to the quantization ecosystem

This experiment was built on top of public engineering work from the Rust ML, GGML, and quantized-inference communities. Their repositories made it possible to focus on a careful Q8K128 test instead of rebuilding the whole stack.

Hugging Face Candle Rust ML

Thanks to the Candle team for the Rust tensor runtime and quantized building blocks that made this Phi-3 CPU inference pipeline practical to extend and inspect.

Open Candle repository →

llama.cpp / GGML Quant formats

Thanks to the llama.cpp and GGML teams for the block-quantization ideas and practical CPU inference work that shaped the Q8K/Q4K baseline for this project.

Open llama.cpp repository →