📋 Table of Contents
- 1. Project Overview
- 2. Why Mix Two Quantization Formats?
- 3. Block Format Anatomy — Q8K vs Q4K
- 4. Layer Assignment Policy
- 5. The Three-Stage Pipeline
- 6. Stage 1 — Quantization
- 7. Stage 2 — Packing
- 8. Stage 3 — Inference & On-the-Fly Dequantization
- 9. Phi-3 Transformer Architecture
- 10. KV Cache & Rotary Position Embeddings
- 11. Conversational Loop & Sampling
- 12. Memory & Performance Numbers
- 13. Deployment & Usage
1. Project Overview
This project implements a complete end-to-end pipeline that compresses Microsoft Phi-3 Mini 4K-Instruct from its original BF16 representation into a mixed Q8K / Q4K quantized format, then runs it as a streaming conversational chatbot — entirely on CPU, with no GPU required.
Unlike uniform quantization (applying the same bit-width everywhere), this implementation assigns different precisions to different layers based on their sensitivity to quantization noise. The result: near-lossless quality at roughly half the original model size.
The current Rust implementation, benchmark logs, and follow-up experiments live in the
nibble GitHub repository,
primarily under phi3_standalone. That repository now contains the mixed Q8K/Q4K
pipeline, strict packer/loader code, perplexity tooling, and the later Q6K/Q8K128 experiments.
By mixing Q8K (3.5× compression, <0.01% error) for attention layers with Q4K (7.1× compression) for the noise-tolerant MLP down-projection, the pipeline achieves aggressive compression on layers that can handle it while preserving full fidelity where it matters most.
Core Components
Mixed Quantization Engine
Layer-aware Q8K/Q4K routing with optional block-wise column permutation for improved block scale utilization
SafeTensors Packer
Merges quantized blocks, F32 norms, and embeddings into a single portable file with auto-detection metadata
Phi-3 Transformer
Full 32-layer attention + MLP pipeline with RoPE, GQA, KV caching, and RMS normalization
Conversational Chat
Multi-turn streaming chat with sliding-window history, adaptive sampling, and Phi-3 chat template
2. Why Mix Two Quantization Formats?
Not all layers in a transformer contribute equally to output quality. Running everything at the same bit-width is either wasteful (8-bit everywhere when some layers could handle 4-bit) or destructive (4-bit everywhere when some layers cannot tolerate the noise).
Q8K for Attention & Gate/Up Projections
The query/key/value projections (qkv_proj), output projection
(o_proj), and the gated MLP projection (gate_up_proj)
operate close to the residual stream and directly shape the
attention scores and gated activations. Quantization errors in these layers
compound across all 32 blocks. Q8K provides ~3.5× compression with
RMSE typically below 1×10-4, making it the safe default.
Q4K for the MLP Down-Projection
The down-projection (down_proj) takes the post-activation,
post-gating MLP output and maps it back to the residual dimension.
By the time data reaches down_proj, the SiLU gating has already
killed most activations (many values are near zero), so the effective
signal-to-noise ratio is inherently lower. This makes down_proj
naturally more tolerant of quantization noise. Q4K achieves ~7.1×
compression here with acceptable RMSE (typically <1×10-3).
F32 for Norms & Embeddings
Layer norms contain only hidden_size = 3072 floats each (12 KB per layer),
so compressing them saves essentially nothing while risking numerical instability.
Embeddings are sparse look-up tables — block-level quantization would degrade
individual token representations unnecessarily.
| Format | Block Size | Bytes / 256 Values | Compression | Bits / Weight |
|---|---|---|---|---|
| F32 (original) | — | 1024 | 1× | 32.0 |
| Q8K | 292 bytes | 292 | 3.51× | 9.14 |
| Q4K | 144 bytes | 144 | 7.11× | 4.50 |
3. Block Format Anatomy — Q8K vs Q4K
Both formats operate on blocks of 256 values
(QK_K = 256 in candle's GGML implementation). Per-block scaling
dramatically reduces error compared to per-tensor scaling, because the scale
adapts to the local magnitude range of each 256-value chunk.
Q8K Block (292 bytes → 256 weights)
Encoding algorithm:
- Find
amax = max(|value|)across the 256 values in the block. - Compute scale:
d = amax / 127.0. - Each float
wis stored asround(w / d)clamped to [-127, 127]. - 16
bsumsentries pre-compute partial dot-product sums for the GGML matmul kernel.
Decoding: w_original ≈ qs[i] × d
Q4K Block (144 bytes → 256 weights)
Encoding algorithm:
- Split 256 values into 8 sub-blocks of 32.
- Each sub-block gets its own
(scale, min)quantization pair. - Each weight is stored as a 4-bit unsigned integer (0–15):
qs[i] = clamp(round((w - sub_min) / sub_scale), 0, 15) - Two 4-bit values are packed into each byte of
qs.
Decoding: w_original ≈ qs[i] × sub_scale + sub_min
A single weight row in qkv_proj is 3072 values wide. If you used one
scale for the entire row, an outlier at position 500 would determine the scale for
all 3072 values, wasting precision on the majority. With QK_K = 256,
you get 12 independent scales per row — each tuned to its local value range.
4. Layer Assignment Policy
For Phi-3 Mini (32 transformer layers, hidden_size=3072,
intermediate_size=8192):
| Tensor Pattern | Count | Format | Rationale |
|---|---|---|---|
*.self_attn.qkv_proj.weight |
32 | Q8K | Fused QKV — central to attention quality |
*.self_attn.o_proj.weight |
32 | Q8K | Attention output → residual stream |
*.mlp.gate_up_proj.weight |
32 | Q8K | Fused gate+up, pre-activation signal |
*.mlp.down_proj.weight |
32 | Q4K | Post-gating; naturally noise-tolerant |
lm_head.weight |
1 | Q8K | Logits — errors affect token selection |
model.embed_tokens.weight |
1 | F32 | Sparse look-up; no block benefit |
*.layernorm.weight, model.norm |
65 | F32 | Tiny (12 KB each); high sensitivity |
Totals: 97 Q8K tensors + 32 Q4K tensors + ~66 F32 tensors
// Returns true if this weight should be quantized at all
fn is_target_weight(name: &str) -> bool {
name.ends_with(".weight")
&& !name.contains("embed_tokens")
&& !name.contains("norm")
}
// Returns true if this layer goes into Q4K (all others → Q8K)
fn is_q4k_layer(name: &str) -> bool {
name.contains("mlp.down_proj")
}
5. The Three-Stage Pipeline
End-to-End Quantization & Inference Pipeline
BF16 SafeTensors
~7.6 GB
quantize_q8k.rs
Q8K + Q4K files
pack_q8k_safetensors.rs
Single .safetensors
loader.rs → model.rs
Streaming Chat
6. Stage 1 — Quantization
The quantizer reads the original HuggingFace model, converts weights to F32, routes each layer to Q8K or Q4K, and writes individual binary files with validation metrics.
fn quantize_rows_q8k(rows: usize, k: usize, data: &[f32]) -> Result<Vec<BlockQ8K>> {
let blocks_per_row = k / QK_K; // e.g. 3072/256 = 12 blocks per row
let mut blocks = vec![BlockQ8K::zeros(); rows * blocks_per_row];
for r in 0..rows {
let row = &data[r * k..(r + 1) * k]; // one output row
let dst = &mut blocks[r * blocks_per_row..]; // destination blocks
BlockQ8K::from_float(row, dst); // GGML kernel: F32 → Q8K
}
Ok(blocks)
}
BlockQ8K::from_float is GGML's own kernel (via candle's k_quants
module). For each 256-value block it scans for amax, sets
d = amax / 127.0, rounds each value to i8, and computes the
16 bsums sub-group sums needed by the optimized matmul kernel.
fn quantize_rows_q4k(rows: usize, k: usize, data: &[f32]) -> Result<Vec<BlockQ4K>> {
let blocks_per_row = k / QK_K;
let mut blocks = vec![BlockQ4K::zeros(); rows * blocks_per_row];
for r in 0..rows {
let row = &data[r * k..(r + 1) * k];
let dst = &mut blocks[r * blocks_per_row..];
BlockQ4K::from_float(row, dst); // GGML kernel: F32 → Q4K
}
Ok(blocks)
}
Optional Column Permutation
An optional preprocessing step reorders the columns of a weight matrix before quantizing, placing the highest-magnitude columns at the front of each 256-value block. This improves quantization quality because the block scale is determined by the largest absolute value — when large and small values share a block, the small values lose precision.
fn build_block_wise_permutation(rows: usize, k: usize, data: &[f32]) -> Vec<usize> {
// Step 1: Compute L2 norm of each column across all rows
let mut col_norms = vec![0f64; k];
for r in 0..rows {
for c in 0..k {
let val = data[r * k + c] as f64;
col_norms[c] += val * val;
}
}
for norm in &mut col_norms { *norm = norm.sqrt(); }
// Step 2: Divide into blocks of 64 columns
let block_size = 64;
let mut global_perm = vec![0usize; k];
// Step 3: Sort within each block by descending L2 norm
for block_start in (0..k).step_by(block_size) {
let block_end = (block_start + block_size).min(k);
let mut indices: Vec<usize> = (block_start..block_end).collect();
indices.sort_by(|&a, &b|
col_norms[b].partial_cmp(&col_norms[a]).unwrap()
);
for (dst, &src) in indices.iter().enumerate() {
global_perm[block_start + dst] = src;
}
}
global_perm // perm[dst] = src
}
Columns never cross 64-column block boundaries. This preserves cache locality during
the matmul and limits the overhead of the per-inference permutation pass. For attention
layers, Q/K/V share the same permutation (computed from q_proj) to keep
attention scores consistent.
Validation Pipeline
After quantizing each layer, three validation metrics are computed by dequantizing
back to F32 via BlockQ8K::to_float (or BlockQ4K::to_float):
fn compute_quantization_error_detailed(
original: &[f32],
blocks: &[BlockQ8K],
rows: usize,
k: usize,
) -> Result<(f64, f64, f64)> {
let mut dequantized = vec![0f32; rows * k];
BlockQ8K::to_float(blocks, &mut dequantized);
let mut l2_error = 0f64;
let mut max_error = 0f64;
let mut relative_error_sum = 0f64;
for (&orig, &deq) in original.iter().zip(dequantized.iter()) {
let err = (orig - deq).abs() as f64;
l2_error += err * err;
if err > max_error { max_error = err; }
if orig.abs() > 1e-10 {
relative_error_sum += err / orig.abs() as f64;
}
}
let rmse = (l2_error / (rows * k) as f64).sqrt();
let mean_relative_error = relative_error_sum / (rows * k) as f64;
Ok((rmse, max_error, mean_relative_error))
}
7. Stage 2 — Packing
The packer merges all quantized block files, metadata, and unquantized F32 tensors
into a single .safetensors file. This simplifies deployment — the
inference binary needs only one (or two, for sharded models) file.
Q8K Packing Layout
Q4K Packing Layout
The header is stripped for Q4K because the SafeTensors format already encodes byte
length via its own shape metadata. The name.q4k_meta tensor provides
out and k so the loader can reconstruct the block layout.
Auto-Detection at Load Time
let load_q = |name: &str| -> candle::Result<QuantLinear> {
let q4k_meta = format!("{}.q4k_meta", name);
// Try Q4K first (down_proj layers)
if st.names().iter().any(|n| **n == q4k_meta) {
return QuantLinear::load_q4k_from_packed_safetensors(&st, name);
}
// Fall back to Q8K
QuantLinear::load_from_packed_safetensors(&st, name)
};
The decision is made per-layer by probing for the presence of .q4k_meta.
Multi-shard loading searches both shard files in sequence until the tensor is found.
8. Stage 3 — Inference & On-the-Fly Dequantization
The QuantLinear struct is the central runtime data structure. One instance
exists for every quantized weight matrix. It dispatches to the correct GGML matmul
kernel based on whether the layer is Q8K or Q4K.
pub enum QuantBlocks {
Q8K(Vec<BlockQ8K>),
Q4K(Vec<BlockQ4K>),
}
pub struct QuantLinear {
pub(crate) blocks: QuantBlocks, // Q8K or Q4K block data
pub(crate) out: usize, // number of output features
pub(crate) k: usize, // number of input features
pub(crate) name: String, // tensor name (for diagnostics)
pub(crate) perm: Option<Vec<usize>>, // column permutation (or None)
}
The Forward Pass
fn forward_2d(&self, x2d: &Tensor, b: usize, k_in: usize) -> candle::Result<Tensor> {
let x_f32 = x2d.to_dtype(DType::F32)?.contiguous()?;
let mut x_vec = x_f32.flatten_all()?.to_vec1::<f32>()?;
// Apply column permutation if this layer was permutation-quantized
if let Some(ref perm) = self.perm {
x_vec = self.apply_permutation_to_input(&x_vec, b, perm);
}
let mut out_buf = vec![0f32; b * self.out];
match &self.blocks {
QuantBlocks::Q8K(blocks) => {
matmul::<BlockQ8K>((b, self.k, self.out), &x_vec, blocks, &mut out_buf)?
}
QuantBlocks::Q4K(blocks) => {
matmul::<BlockQ4K>((b, self.k, self.out), &x_vec, blocks, &mut out_buf)?
}
}
Tensor::from_vec(out_buf, (b, self.out), x2d.device())
}
The GGML matmul kernel processes one block at a time. For each block,
it reads the scale d and the 256 quantized values, computes
qs[i] × d as a temporary register value, and immediately
multiplies it into the accumulator. The dequantized value never touches
RAM as a full F32 matrix. Peak overhead per token: <24 KB across all
layers.
9. Phi-3 Transformer Architecture
The model implements the full Phi-3 Mini 4K-Instruct transformer: 32 identical blocks, each containing causal self-attention with RoPE and a gated MLP with SiLU activation. All computation runs at F32 precision, with quantized weights decoded on-the-fly during each matrix multiplication.
pub struct Block {
pub rms_1: Tensor, // input layernorm weights (F32)
pub attn: CausalSelfAttention, // Q/K/V proj + O proj (Q8K)
pub rms_2: Tensor, // post-attention layernorm (F32)
pub mlp: Mlp, // gate_up (Q8K) + down (Q4K)
}
impl Block {
pub fn forward(&self, x: &Tensor, index_pos: usize,
block_idx: usize, cache: &mut Cache) -> candle::Result<Tensor> {
let residual = x;
let x = rms_norm(x, &self.rms_1, 1e-5)?;
let x = (self.attn.forward(&x, index_pos, block_idx, cache)? + residual)?;
let residual = &x;
let x = rms_norm(&x, &self.rms_2, 1e-5)?;
let x = (self.mlp.forward(&x)? + residual)?;
Ok(x)
}
}
Gated MLP with Mixed Precision
pub struct Mlp {
pub gate_up_proj: QuantLinear, // Q8K — fused gate + up projection
pub down_proj: QuantLinear, // Q4K — down projection
pub hidden_size: usize,
}
impl Mlp {
pub fn forward(&self, x: &Tensor) -> candle::Result<Tensor> {
let gate_up = self.gate_up_proj.forward(x)?; // Q8K matmul
let gate = gate_up.narrow(D::Minus1, 0, self.hidden_size)?;
let up = gate_up.narrow(D::Minus1, self.hidden_size, self.hidden_size)?;
let x = (silu(&gate)? * up)?; // SiLU gating
self.down_proj.forward(&x) // Q4K matmul
}
}
The SiLU gating between gate_up_proj (Q8K) and down_proj
(Q4K) acts as a natural smoother — zeroing out most activations and reducing
the effective dynamic range of the input to down_proj. This is precisely
why Q4K works well here despite its lower precision.
Causal Self-Attention with RoPE
pub fn forward(&self, x: &Tensor, index_pos: usize,
block_idx: usize, cache: &mut Cache) -> candle::Result<Tensor> {
let (b_sz, seq_len, hidden_size) = x.dims3()?;
// Fused QKV projection (single Q8K matmul)
let qkv = self.qkv_proj.forward(x)?;
// Split into Q, K, V
let q = qkv.narrow(D::Minus1, 0, q_size)?;
let k = qkv.narrow(D::Minus1, q_size, kv_size)?;
let v = qkv.narrow(D::Minus1, q_size + kv_size, kv_size)?;
// Apply Rotary Position Embeddings
let q = self.apply_rotary_emb(&q, index_pos, cache)?;
let k = self.apply_rotary_emb(&k, index_pos, cache)?;
// Concatenate with KV cache
if cache.use_kv_cache {
if let Some((cache_k, cache_v)) = &cache.kvs[block_idx] {
k = Tensor::cat(&[cache_k, &k], 2)?.contiguous()?;
v = Tensor::cat(&[cache_v, &v], 2)?.contiguous()?;
}
cache.kvs[block_idx] = Some((k.clone(), v.clone()));
}
// Scaled dot-product attention with causal mask
let att = q.matmul(&k.t()?)? / (head_dim as f64).sqrt();
// ... causal masking + softmax + V projection ...
self.o_proj.forward(&y) // Q8K output projection
}
10. KV Cache & Rotary Position Embeddings
The Cache struct serves three purposes: storing past Key/Value tensors,
pre-computing RoPE sin/cos tables, and generating causal attention masks.
#[derive(Debug, Clone)]
pub struct Cache {
pub masks: HashMap<(usize, usize), Tensor>, // cached causal masks
pub use_kv_cache: bool,
pub kvs: Vec<Option<(Tensor, Tensor)>>, // 32 slots: one (K,V) per layer
pub cos: Tensor, // [4096, 48] RoPE cosine table
pub sin: Tensor, // [4096, 48] RoPE sine table
pub device: Device,
}
RoPE Pre-computation
Rotary Position Embeddings encode position by rotating query/key vectors in 2D subspaces. The frequency for dimension pair i at position m is:
θi(m) = m × 10000-2i/d
The cache pre-computes cos(θ) and sin(θ) for all
4096 positions and all 48 dimension pairs at startup:
let theta = 10000.0f32;
let inv_freq: Vec<f32> = (0..head_dim)
.step_by(2)
.map(|i| 1.0 / theta.powf(i as f32 / head_dim as f32))
.collect();
let inv_freq = Tensor::from_vec(inv_freq, (inv_freq_len,), device)?;
let t = Tensor::arange(0u32, max_seq_len as u32, device)?
.to_dtype(DType::F32)?
.reshape((max_seq_len, 1))?;
let freqs = t.matmul(&inv_freq.reshape((1, inv_freq.elem_count()))?)?;
let cos = freqs.cos()?.to_dtype(dtype)?; // [4096, 48]
let sin = freqs.sin()?.to_dtype(dtype)?; // [4096, 48]
At inference time, the model slices the relevant rows via
cache.cos.narrow(0, index_pos, seq_len) — no trigonometry
is computed during generation.
Why the Cache Matters for Performance
num_layers × 2 × num_kv_heads × head_dim × 4 bytes
= 32 × 2 × 32 × 96 × 4 = 786,432 bytes ≈ 0.75 MB / token
At full context (4096 tokens): ~3.1 GB. The sliding window trimmer in
conversation.rs caps history at 3072 tokens to prevent exhausting RAM.
11. Conversational Loop & Sampling
The main loop reads user input from stdin, formats it with Phi-3's chat template
(<|user|>\n...\n<|end|>\n<|assistant|>\n), tokenizes,
runs autoregressive generation, and decodes the output token-by-token with streaming.
Adaptive Sampling Strategy
Per-turn sampling parameters are chosen based on heuristic keyword detection:
let is_code_question = user_input.contains("code")
|| user_input.contains("example")
|| user_input.contains("function")
|| user_input.contains("loop");
let (sampling, repeat_pen, rep_last_n) = if is_code_question {
// Low temperature → deterministic, precise code
(Sampling::TopKThenTopP { k: 25, p: 0.95, temperature: 0.35 }, 1.3, 128)
} else {
// Higher temperature → natural conversation
(Sampling::TopKThenTopP { k: 50, p: 0.9, temperature: 0.6 }, 1.35, 256)
};
| Parameter | Code Mode | Normal Mode | Effect |
|---|---|---|---|
k (top-k) |
25 | 50 | Vocab narrowed to top K tokens |
p (nucleus) |
0.95 | 0.90 | Cumulative probability cutoff |
temperature |
0.35 | 0.60 | Lower = more deterministic |
repeat_pen |
1.30 | 1.35 | Penalty on recently generated tokens |
First Turn vs. Subsequent Turns
A crucial optimization: on the first turn, the full prompt (system + user message)
is fed through all 32 layers. On subsequent turns, only the new user tokens are
processed — the KV cache already holds all previous key/value projections.
This is why global_position keeps growing across turns:
Sliding Window History
pub fn apply_sliding_window(&mut self, tokenizer: &Tokenizer) -> candle::Result<()> {
let tokens = tokenizer.encode(self.format_prompt(tokenizer)?, false)?
.get_ids().len();
if tokens <= self.max_history_tokens { return Ok(()); }
// Keep system prompt, then add recent messages from newest to oldest
let mut kept = vec![self.messages[0].clone()];
let mut current_tokens = /* system prompt token count */;
for msg in self.messages.iter().skip(1).rev() {
let msg_tokens = tokenizer.encode(msg.content.clone(), false)?
.get_ids().len();
if current_tokens + msg_tokens > self.max_history_tokens { break; }
kept.insert(1, msg.clone());
current_tokens += msg_tokens;
}
self.messages = kept;
Ok(())
}
12. Memory & Performance Numbers
Model File Sizes
| Format | File Size | Compression | Notes |
|---|---|---|---|
| Original BF16 | ~7.6 GB | 1× | HuggingFace download |
| Mixed Q8K/Q4K (this project) | ~4.1 GB | 1.85× | Deployed as artemr87/phi3-mixed |
| Pure GGUF Q4_K_M | ~2.2 GB | 3.45× | Smaller but lower quality |
Per-Layer Weight Memory
| Layer | Shape | Format | Memory |
|---|---|---|---|
qkv_proj × 32 |
[9216, 3072] | Q8K | ~1.0 GB |
o_proj × 32 |
[3072, 3072] | Q8K | ~346 MB |
gate_up_proj × 32 |
[16384, 3072] | Q8K | ~1.8 GB |
down_proj × 32 |
[3072, 8192] | Q4K | ~388 MB |
lm_head |
[32000, 3072] | Q8K | 337 MB |
embed_tokens |
[32000, 3072] | F32 | 375 MB |
Runtime Memory Growth
Inference Speed (Observed CPU Run)
| Run | Speed | Notes |
|---|---|---|
| Short interactive prompt | 9.6 tok/s | Baseline Q8K/Q4K shards; prompt: What is the size of the Sun? |
| Longer conversations | varies | KV cache and prompt length grow, so per-response speed is workload-dependent |
13. Deployment & Usage
Build from Source
The source is maintained in
github.com/artem1984A/nibble.
The Phi-3 mixed-quantization binaries are in the phi3_standalone crate.
git clone https://github.com/artem1984A/nibble.git
cd nibble/phi3_standalone
cargo build --release \
--bin quantize_q8k \
--bin pack_q8k_safetensors \
--bin phi3-mixed-quant
SNAP=~/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/<rev>
# Standard quantization:
./target/release/quantize_q8k \
"$SNAP/model-00001-of-00002.safetensors" \
./quantized/
./target/release/quantize_q8k \
"$SNAP/model-00002-of-00002.safetensors" \
./quantized/
# Optional experimental column permutation:
cargo build --release --features experimental-perm --bin quantize_q8k
CANDLE_Q8K_PERMUTE=1 ./target/release/quantize_q8k \
"$SNAP/model-00001-of-00002.safetensors" \
./quantized-perm/
CANDLE_Q8K_PERMUTE=1 ./target/release/quantize_q8k \
"$SNAP/model-00002-of-00002.safetensors" \
./quantized-perm/
./target/release/pack_q8k_safetensors \
"$SNAP/model-00001-of-00002.safetensors" \
./quantized/ \
./packed-shard1.safetensors
./target/release/pack_q8k_safetensors \
"$SNAP/model-00002-of-00002.safetensors" \
./quantized/ \
./packed-shard2.safetensors
./target/release/phi3-mixed-quant \
./packed-shard1.safetensors \
./packed-shard2.safetensors
Docker Deployment
docker pull artemr87/phi3-mixed:1.0.1
docker run -it --rm \
-v /path/to/phi3-mixed-shard1.safetensors:/app/shard1.safetensors:ro \
-v /path/to/phi3-mixed-shard2.safetensors:/app/shard2.safetensors:ro \
-v /path/to/tokenizer.json:/app/tokenizer.json:ro \
artemr87/phi3-mixed:1.0.1
Chat Commands
| Input | Effect |
|---|---|
| (any text) | Send a message; model streams a response |
reset |
Clear conversation history and KV cache |
exit |
Quit the program |
After each response, the system prints diagnostics:
[Pos: 329 | Cache: 329 tok (246.8 MB) | Speed: 9.6 t/s | Hist: 3 msgs]
File Reference
| File | Role |
|---|---|
quantize_q8k.rs |
Stage 1: converts BF16 weights to Q8K/Q4K block files |
pack_q8k_safetensors.rs |
Stage 2: merges quantized files + F32 into one SafeTensors |
types.rs |
Shared types: Q8KHeader, Phi3Config, conversions |
quant_linear.rs |
Runtime QuantLinear: on-the-fly matmul dispatch |
loader.rs |
Builds the Phi3 model from packed SafeTensors |
model.rs |
Transformer: attention, RoPE, MLP, RMS norm, blocks |
cache.rs |
KV cache, RoPE tables, causal mask generation |
conversation.rs |
Sliding-window history, Phi-3 chat template |
main.rs |
Entry point: CLI, model load, chat loop, streaming |
🙏 Acknowledgments
This project builds on the work of the open-source community:
- Hugging Face Candle — Minimalist ML framework for Rust with optimized GGML quantization kernels
- Microsoft Phi-3 Team — Compact yet powerful 3.8B instruction-tuned language model
- llama.cpp — Inspiration for Q8K/Q4K block quantization schemes and GGML format
- SafeTensors — Secure, zero-copy tensor serialization format
- Rust Language — Memory-safe systems programming without garbage collection
📚 Explore More
Dive deeper into the implementation, inspect the Rust source and benchmark logs, or read the follow-up Q8K128 experiment.