TinyLlama Q8K Quantization Engine - Technical Deep Dive

📋 Table of Contents

1. Project Overview
2. System Architecture
3. Q8K Quantization Deep Dive
4. Advanced Permutation Strategies
5. Attention Mechanism & RoPE
6. KV Cache Implementation
7. Conversational AI Features
8. Performance Benchmarks
9. Production Deployment
10. Try It Yourself

1. Project Overview

This project implements a production-grade quantization engine for TinyLlama-1.1B-Chat, reducing model size from ~5GB (FP16) to ~1.3GB (Q8K) while maintaining <0.1% mean relative error. Built with Rust and the Candle ML framework from Hugging Face.

8.7

Tokens/Second (CPU)

Model Compression

<0.1%

Accuracy Loss

1.3GB

Final Model Size

🎯 Key Achievement

Successfully deployed to production via Docker with an Angular 19 chat interface, achieving real-time inference on consumer CPUs without GPU acceleration.

2. System Architecture

End-to-End Inference Pipeline

Original Model
TinyLlama FP16

→

Quantization
Q8K + Permutation

→

SafeTensors
Packed Format

→

Inference
Rust Runtime

→

Output
8.7 tok/s

Core Components

⚙️

Quantization Engine

Multi-strategy Q8K quantizer with SVD-importance, QR pivot, and block-wise permutation support

🧠

Attention Mechanism

Optimized causal self-attention with RoPE (Rotary Position Embeddings) and GQA (Grouped Query Attention)

💾

KV Cache System

Efficient key-value caching with sliding window context management (up to 2048 tokens)

🔄

Conversational AI

Context-aware chat with automatic history trimming and template-based prompt formatting

3. Q8K Quantization Deep Dive

Q8K (8-bit K-quantization) is a block-based quantization scheme that divides weight matrices into 256-element blocks (QK_K = 256), applying per-block scale factors to maintain numerical precision.

Why Q8K Over Other Methods?

Method	Size	Speed (tok/s)	Accuracy	Notes
Q8K (Ours)	1.3GB	8.7	<0.1% loss	Best balance
Q4_K_M	645MB	5.8	~2% loss	Smaller, slower
FP16	5.1GB	3.2	100%	Full precision
INT8 (naive)	1.3GB	4.5	~5% loss	No block scaling

Quantization Algorithm

quantize_q8k.rs

fn quantize_rows_q8k(rows: usize, k: usize, data: &[f32]) -> Result<Vec<BlockQ8K>> {
    if k % QK_K != 0 {
        bail!("inner dim {k} not multiple of {QK_K}");
    }
    
    let blocks_per_row = k / QK_K;
    let mut blocks = vec![BlockQ8K::zeros(); rows * blocks_per_row];
    
    // Quantize each row into blocks
    for r in 0..rows {
        let row = &data[r * k..(r + 1) * k];
        let dst = &mut blocks[r * blocks_per_row..(r + 1) * blocks_per_row];
        BlockQ8K::from_float(row, dst);  // Candle's optimized quantization
    }
    
    Ok(blocks)
}

Validation Pipeline

The quantizer implements a 3-tier validation system to ensure quality:

Validation Metrics

fn compute_quantization_error_detailed(
    original: &[f32],
    blocks: &[BlockQ8K],
    rows: usize,
    k: usize,
) -> Result<(f64, f64, f64)> {
    // Dequantize back to FP32
    let mut dequantized = vec![0f32; rows * k];
    BlockQ8K::to_float(blocks, &mut dequantized);
    
    let mut l2_error = 0f64;
    let mut max_error = 0f64;
    let mut relative_error_sum = 0f64;
    
    for (i, (&orig, &deq)) in original.iter().zip(dequantized.iter()).enumerate() {
        let abs_err = (orig - deq).abs() as f64;
        let sq_err = abs_err * abs_err;
        
        l2_error += sq_err;
        max_error = max_error.max(abs_err);
        
        if orig.abs() > 1e-10 {
            relative_error_sum += abs_err / orig.abs() as f64;
        }
    }
    
    let rmse = (l2_error / (rows * k) as f64).sqrt();
    let mean_relative_error = relative_error_sum / (rows * k) as f64;
    
    Ok((rmse, max_error, mean_relative_error))
}

⚠️ Quality Threshold

Layers with max_error > 1e-2 or relative_error > 0.01 trigger warnings. This typically indicates the need for advanced permutation strategies.

4. Advanced Permutation Strategies

Column permutation reorders weight matrix columns to cluster similar magnitudes, significantly improving quantization efficiency. We implement three strategies:

4.1 SVD-Importance Ranking

Inspired by Singular Value Decomposition, this method ranks columns by statistical variance (a proxy for singular values). High-variance columns are quantized first for optimal precision.

SVD-Based Permutation

fn svd_importance_permutation(rows: usize, k: usize, data: &[f32]) -> Result<Vec<usize>> {
    // Compute column statistics
    let mut col_means: Vec<f64> = vec![0.0; k];
    let mut col_vars: Vec<f64> = vec![0.0; k];
    
    // First pass: compute means
    for r in 0..rows {
        let row = &data[r * k..(r + 1) * k];
        for (j, &v) in row.iter().enumerate() {
            col_means[j] += v as f64;
        }
    }
    for mean in &mut col_means {
        *mean /= rows as f64;
    }
    
    // Second pass: compute variances
    for r in 0..rows {
        let row = &data[r * k..(r + 1) * k];
        for (j, &v) in row.iter().enumerate() {
            let diff = v as f64 - col_means[j];
            col_vars[j] += diff * diff;
        }
    }
    
    // Sort by descending variance (high variance = high importance)
    let mut idx: Vec<usize> = (0..k).collect();
    idx.sort_by(|&a, &b| {
        col_vars[b].partial_cmp(&col_vars[a]).unwrap_or(std::cmp::Ordering::Equal)
    });
    
    println!("    SVD-importance: sorted by column variance");
    Ok(idx)
}

4.2 QR Pivot Strategy

Implements a simplified Householder QR decomposition with column pivoting. This orthogonalizes columns, reducing redundancy and improving block-wise quantization accuracy.

💡 Adaptive QR Steps

To balance speed and quality, QR steps scale with matrix size:

k ≤ 64: Full QR (100%)
k ≤ 256: 87.5% QR
k ≤ 2048: 50% QR
k > 2048: 25% QR (fallback to L2 norm for remaining)

QR Pivot Implementation

fn qr_pivot_permutation(rows: usize, k: usize, data: &[f32]) -> Result<Vec<usize>> {
    let mut perm: Vec<usize> = (0..k).collect();
    
    // Adaptive QR steps
    let qr_steps = match k {
        k if k <= 64 => k,
        k if k <= 256 => (k * 7) / 8,
        k if k <= 512 => (k * 3) / 4,
        k if k <= 1024 => (k * 2) / 3,
        k if k <= 2048 => (k / 2),
        _ => (k / 4).max(256).min(512),
    };
    
    // Compute column norms for pivoting
    let mut col_norms: Vec<f64> = vec![0.0; k];
    for r in 0..rows {
        let row = &data[r * k..(r + 1) * k];
        for (j, &v) in row.iter().enumerate() {
            col_norms[j] += (v as f64) * (v as f64);
        }
    }
    for norm in &mut col_norms {
        *norm = norm.sqrt();
    }
    
    // QR column pivoting (Householder-inspired)
    for step in 0..qr_steps.min(rows).min(k) {
        let mut max_norm = col_norms[step];
        let mut max_idx = step;
        
        for j in (step + 1)..k {
            if col_norms[j] > max_norm {
                max_norm = col_norms[j];
                max_idx = j;
            }
        }
        
        if max_idx != step {
            perm.swap(step, max_idx);
            col_norms.swap(step, max_idx);
        }
        
        // Update remaining norms (simplified orthogonalization)
        if col_norms[step] > 1e-10 {
            for j in (step + 1)..k {
                col_norms[j] *= 0.99; // Decay factor
            }
        }
    }
    
    Ok(perm)
}

4.3 Block-Wise Permutation

For large matrices (k > 256), block-wise permutation divides columns into 64-element blocks and sorts within each block independently. This maintains cache locality while improving quantization.

Block-Wise Strategy

fn build_block_wise_permutation(rows: usize, k: usize, data: &[f32]) -> Vec<usize> {
    const BLOCK_SIZE: usize = 64; // QK_K/4 for locality
    let num_blocks = k / BLOCK_SIZE;
    
    if k % BLOCK_SIZE != 0 {
        return build_column_permutation(&column_l2_norms(rows, k, data));
    }
    
    let mut global_perm = vec![0usize; k];
    let col_norms = column_l2_norms(rows, k, data);
    
    for block_idx in 0..num_blocks {
        let block_start = block_idx * BLOCK_SIZE;
        let block_end = block_start + BLOCK_SIZE;
        
        let block_norms = &col_norms[block_start..block_end];
        let mut local_idx: Vec<usize> = (0..BLOCK_SIZE).collect();
        
        // Sort by descending norm within block
        local_idx.sort_by(|&a, &b| {
            block_norms[b].partial_cmp(&block_norms[a])
                .unwrap_or(std::cmp::Ordering::Equal)
        });
        
        // Map local to global indices
        for i in 0..BLOCK_SIZE {
            global_perm[block_start + i] = block_start + local_idx[i];
        }
    }
    
    global_perm
}

5. Attention Mechanism & RoPE

The model implements Grouped Query Attention (GQA) with Rotary Position Embeddings (RoPE), providing efficient long-context support.

5.1 Causal Self-Attention

llama-q8k.rs

struct CausalSelfAttention {
    q_proj: QuantLinear,  // Query projection (quantized)
    k_proj: QuantLinear,  // Key projection
    v_proj: QuantLinear,  // Value projection
    o_proj: QuantLinear,  // Output projection
    num_attention_heads: usize,      // 32 for TinyLlama
    num_key_value_heads: usize,      // 4 for GQA
    head_dim: usize,                 // 64 (2048 / 32)
}

impl CausalSelfAttention {
    fn forward(
        &self,
        x: &Tensor,
        index_pos: usize,
        block_idx: usize,
        cache: &mut Cache,
    ) -> candle::Result<Tensor> {
        let (b_sz, seq_len, hidden_size) = x.dims3()?;
        
        // Project to Q, K, V
        let q = self.q_proj.forward(x)?;
        let k = self.k_proj.forward(x)?;
        let v = self.v_proj.forward(x)?;

        // Reshape for multi-head attention
        let q = q.reshape((b_sz, seq_len, self.num_attention_heads, self.head_dim))?
            .transpose(1, 2)?;
        let k = k.reshape((b_sz, seq_len, self.num_key_value_heads, self.head_dim))?
            .transpose(1, 2)?;
        let mut v = v.reshape((b_sz, seq_len, self.num_key_value_heads, self.head_dim))?
            .transpose(1, 2)?;

        // Apply RoPE
        let q = self.apply_rotary_emb(&q, index_pos, cache)?;
        let mut k = self.apply_rotary_emb(&k, index_pos, cache)?;

        // KV cache management
        if cache.use_kv_cache {
            if let Some((cache_k, cache_v)) = &cache.kvs[block_idx] {
                k = Tensor::cat(&[cache_k, &k], 2)?;
                v = Tensor::cat(&[cache_v, &v], 2)?;
            }
            cache.kvs[block_idx] = Some((k.clone(), v.clone()));
        }

        // Repeat K, V for GQA (4 KV heads -> 32 Q heads)
        let k = self.repeat_kv(k)?;
        let v = self.repeat_kv(v)?;

        // Compute attention scores (scaled dot-product)
        let att = (q.matmul(&k.t()?)? / (self.head_dim as f64).sqrt())?;
        
        // Apply causal mask (prevent attending to future tokens)
        let att = if seq_len > 1 {
            let mask = cache.mask_query_kv(seq_len, k.dims()[2])?;
            masked_fill(&att, &mask, f32::NEG_INFINITY)?
        } else {
            att
        };

        // Softmax + weighted sum
        let att = candle_nn::ops::softmax_last_dim(&att)?;
        let y = att.matmul(&v)?;
        
        // Reshape and project output
        let y = y.transpose(1, 2)?.reshape(&[b_sz, seq_len, hidden_size])?;
        self.o_proj.forward(&y)
    }
}

5.2 Rotary Position Embeddings (RoPE)

RoPE encodes positional information by rotating query/key vectors in complex space, enabling the model to handle sequences up to 2048 tokens efficiently.

RoPE Implementation

fn apply_rotary_emb(&self, x: &Tensor, index_pos: usize, cache: &Cache) -> candle::Result<Tensor> {
    let (_b_sz, _n_head, seq_len, _head_dim) = x.dims4()?;
    
    // Extract precomputed cos/sin tables
    let cos = cache.cos.narrow(0, index_pos, seq_len)?;
    let sin = cache.sin.narrow(0, index_pos, seq_len)?;
    
    // Apply rotation (Candle's optimized implementation)
    candle_nn::rotary_emb::rope(x, &cos, &sin)
}

🔢 RoPE Precomputation

Cosine and sine tables are precomputed during cache initialization using theta = 10000.0. This amortizes computation cost across all inference steps.

6. KV Cache Implementation

Key-Value (KV) caching stores computed key/value tensors from previous tokens, avoiding recomputation during autoregressive generation. This provides a ~10x speedup for multi-token sequences.

Cache Structure

#[derive(Debug, Clone)]
struct Cache {
    masks: HashMap<(usize, usize), Tensor>,  // Precomputed causal masks
    use_kv_cache: bool,                      // Enable/disable caching
    kvs: Vec<Option<(Tensor, Tensor)>>,      // Per-layer K, V tensors
    cos: Tensor,                             // RoPE cosine table (2048 x 32)
    sin: Tensor,                             // RoPE sine table (2048 x 32)
    device: Device,
}

impl Cache {
    fn new(
        use_kv_cache: bool,
        dtype: DType,
        max_seq_len: usize,
        head_dim: usize,
        num_layers: usize,
        device: &Device,
    ) -> candle::Result<Self> {
        // Precompute RoPE frequencies
        let theta = 10000.0f32;
        let inv_freq: Vec<f32> = (0..head_dim)
            .step_by(2)
            .map(|i| 1.0 / theta.powf(i as f32 / head_dim as f32))
            .collect();
        
        let inv_freq = Tensor::from_vec(inv_freq, (head_dim / 2,), device)?;
        let t = Tensor::arange(0u32, max_seq_len as u32, device)?
            .to_dtype(DType::F32)?
            .reshape((max_seq_len, 1))?;
        
        let freqs = t.matmul(&inv_freq.reshape((1, head_dim / 2))?)?;
        let cos = freqs.cos()?.to_dtype(dtype)?;
        let sin = freqs.sin()?.to_dtype(dtype)?;

        Ok(Self {
            masks: HashMap::new(),
            use_kv_cache,
            kvs: vec![None; num_layers],
            device: device.clone(),
            cos,
            sin,
        })
    }

    fn estimate_tokens(&self) -> usize {
        self.kvs.iter()
            .filter_map(|kv| kv.as_ref())
            .map(|(k, _)| k.dims()[2])
            .max()
            .unwrap_or(0)
    }

    fn memory_mb(&self) -> f32 {
        let mut total_bytes = 0;
        for kv in &self.kvs {
            if let Some((k, v)) = kv {
                total_bytes += k.elem_count() * 4;  // F32 = 4 bytes
                total_bytes += v.elem_count() * 4;
            }
        }
        total_bytes as f32 / (1024.0 * 1024.0)
    }

    fn reset_for_new_turn(&mut self) {
        self.kvs = vec![None; self.kvs.len()];
        self.masks.clear();
    }
}

Cache Memory Management

For TinyLlama (22 layers, 4 KV heads, 64 head dim):

Per-token memory: 22 layers × 4 heads × 64 dim × 2 (K+V) × 4 bytes = 45KB/token
At 256 tokens: ~11.5 MB
At 1536 tokens (max context): ~69 MB

7. Conversational AI Features

The model implements a sliding window context manager that automatically trims conversation history when approaching the 2048-token limit.

Conversation Management

struct Conversation {
    messages: Vec<Message>,
    max_history_tokens: usize,  // 1536 tokens (leaves 512 for generation)
    system_prompt: String,
}

impl Conversation {
    fn apply_sliding_window(&mut self, tokenizer: &Tokenizer) -> candle::Result<()> {
        let full_prompt = self.format_prompt(tokenizer)?;
        let tokens = tokenizer.encode(full_prompt, false)?.get_ids().len();

        if tokens <= self.max_history_tokens {
            return Ok(());
        }
        
        // Keep system prompt + most recent messages
        let mut kept_messages = vec![self.messages[0].clone()];
        let mut current_tokens = tokenizer
            .encode(self.system_prompt.clone(), false)?
            .get_ids().len();

        for msg in self.messages.iter().skip(1).rev() {
            let msg_tokens = tokenizer.encode(msg.content.clone(), false)?
                .get_ids().len();

            if current_tokens + msg_tokens > self.max_history_tokens {
                break;
            }

            kept_messages.insert(1, msg.clone());
            current_tokens += msg_tokens;
        }

        let removed = self.messages.len() - kept_messages.len();
        println!("Context trimmed: kept {} msgs, removed {} old msgs", 
                 kept_messages.len(), removed);
        
        self.messages = kept_messages;
        Ok(())
    }
}

Prompt Formatting

TinyLlama uses the ChatML format with special tokens:

ChatML Template

<|system|>
You are a helpful AI assistant. You provide concise, accurate answers.</s>
<|user|>
What is Q8K quantization?</s>
<|assistant|>
Q8K is a block-based 8-bit quantization method...</s>
<|user|>
How does it compare to Q4?</s>
<|assistant|>

8. Performance Benchmarks

Inference Speed

Metric	Cold Start	Warm (with KV cache)	Notes
First token latency	450ms	115ms	Prompt encoding + first forward pass
Subsequent tokens	N/A	115ms/token	8.7 tokens/second
Cache memory (256 tok)	0 MB	11.5 MB	Per-layer K, V tensors
Total speedup (KV cache)	~10x for multi-token generation		Amortized over sequence

Quantization Quality Metrics

Layer Type	RMSE	Max Error	Mean Relative Error
Attention Q/K/V	2.4e-4	8.1e-3	0.06%
Attention Output	1.8e-4	6.2e-3	0.05%
MLP Gate/Up	3.2e-4	9.4e-3	0.08%
MLP Down	2.1e-4	7.5e-3	0.06%

✅ Quality Guarantee

All layers meet the <1% relative error threshold. The quantizer automatically flags problematic layers and recommends permutation strategies.

System Resource Usage

~2.5GB

Peak RAM (model + cache)

1 Core

CPU Utilization (100%)

0.0%

GPU Usage (CPU-only)

~150ms

P95 Latency (single token)

9. Production Deployment

Docker Configuration

Dockerfile (Multi-stage Build)

# Stage 1: Build Rust inference engine
FROM rust:1.75-slim as rust-builder
WORKDIR /build
COPY candle-examples/Cargo.toml candle-examples/Cargo.lock ./
COPY candle-examples/examples/llama-q8k.rs ./examples/
RUN cargo build --release --example llama-q8k

# Stage 2: Build Angular frontend
FROM node:20-alpine as angular-builder
WORKDIR /app
COPY angular-chat/package*.json ./
RUN npm ci
COPY angular-chat/ ./
RUN npm run build -- --configuration production

# Stage 3: Production runtime
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y \
    ca-certificates \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY --from=rust-builder /build/target/release/examples/llama-q8k ./
COPY --from=angular-builder /app/dist/angular-chat ./static
COPY model-q8k-packed.safetensors ./
COPY tokenizer.json ./

EXPOSE 8080
CMD ["./llama-q8k", "model-q8k-packed.safetensors"]

Deployment Options

🐳

Docker Standalone

Single container with Rust backend + Angular frontend. Ideal for quick deployment.

docker run -p 8080:8080 artemr87/tinyllama-q8k:latest

☸️

Kubernetes

Horizontal scaling with HPA based on CPU utilization (target: 70%).

🌐

Reverse Proxy

Nginx frontend with WebSocket support for real-time chat streaming.

Optimization Flags

Cargo.toml (Release Profile)

[profile.release]
opt-level = 3              # Maximum optimization
lto = "fat"               # Link-time optimization
codegen-units = 1         # Single codegen unit for best optimization
panic = "abort"           # Smaller binary, faster execution
strip = true              # Remove debug symbols

[dependencies]
candle-core = { version = "0.3.0", features = ["mkl"] }  # Intel MKL for BLAS
candle-nn = "0.3.0"
candle-transformers = "0.3.0"
tokenizers = "0.15.0"

🚀 Performance Tip

Using Intel MKL instead of OpenBLAS provides ~15% speedup on x86_64 CPUs. For ARM (Apple Silicon), use Accelerate framework.

10. Try It Yourself

🎮 Interactive Demo

Experience the quantized TinyLlama model in action with our production-ready Docker container. Runs efficiently on consumer CPUs without GPU requirements.

🚀 Launch Live Demo

Note: Live demo uses WebAssembly. For full features, use Docker deployment below.

🐳 Docker Deployment (Recommended)

The quantized model is available as a production-ready Docker container (5GB, includes model + runtime). This is currently the primary deployment method while GitHub source code release is under review.

📦 Docker Hub Repository

Image: artemr87/tinyllama-q8k:latest
Size: 5 GB (includes quantized model + Rust runtime)
Last Updated: 13 days ago
Tags Available: latest, 2.0.3, 2.0.2, 1.0.0

Quick Start (Interactive Mode)

Run with Docker

# Pull and run the latest version
docker run -it --rm artemr87/tinyllama-q8k:latest

# Sample output:
# 🔧 Loading model...
# 🤖 TinyLlama Q8K Conversational AI
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Model: TinyLlama-1.1B-Chat (Q8K quantized)
# Max context: 2048 tokens | History: 1536 tokens
# Commands: 'exit' to quit | 'reset' to clear history
#
# 💬 You: _

Advanced Usage

Custom Configuration

# Run with specific tag version
docker run -it --rm artemr87/tinyllama-q8k:2.0.3

# Run in background with custom port mapping
docker run -d -p 8080:8080 --name tinyllama-chat \
  artemr87/tinyllama-q8k:latest

# Run with resource limits (recommended for production)
docker run -it --rm \
  --memory="4g" \
  --cpus="2" \
  artemr87/tinyllama-q8k:latest

# Access logs from background container
docker logs -f tinyllama-chat

# Stop background container
docker stop tinyllama-chat

Docker Compose (Production Setup)

docker-compose.yml

version: '3.8'

services:
  tinyllama:
    image: artemr87/tinyllama-q8k:latest
    container_name: tinyllama-q8k
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      - RUST_LOG=info
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
        reservations:
          cpus: '1'
          memory: 2G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

# Usage:
# docker-compose up -d          # Start in background
# docker-compose logs -f        # View logs
# docker-compose down           # Stop and remove

What's Included in the Container?

📦

Quantized Model

TinyLlama-1.1B-Chat pre-quantized to Q8K format (~1.3GB)

⚙️

Rust Runtime

Optimized inference engine with KV cache and RoPE attention

🔧

Tokenizer

Pre-configured TinyLlama tokenizer (32K vocabulary)

⚙️ System Requirements

Docker: 20.10+ (Docker Engine or Docker Desktop)
RAM: 4GB+ recommended (2GB minimum)
CPU: x86_64 (Intel/AMD) or ARM64 (Apple Silicon)
Disk: 6GB free space (5GB image + cache)
OS: Linux, macOS, Windows 10/11 with WSL2

🔧 Build from Source (Coming Soon)

The complete source code and build toolchain are currently under review for public release on GitHub. Once approved, you'll be able to:

Clone the repository and build locally
Customize quantization strategies (SVD, QR pivot, block-wise)
Experiment with different model architectures
Contribute improvements via pull requests

📢 GitHub Release Status

Repository: ryzhov-artem/candle_quant (private, under review)
Expected Release: Q1 2026
Will Include: Full source code, quantization tools, training scripts, documentation

Follow the Docker Hub repository for updates on GitHub release timing.

Preview: Local Build Process (Future)

Future GitHub Workflow

# Step 1: Clone repository (when public)
git clone https://github.com/ryzhov-artem/candle_quant.git
cd candle_quant

# Step 2: Download base model
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --local-dir ./models/tinyllama

# Step 3: Quantize with custom strategy
CANDLE_Q8K_PERMUTE=true cargo run --release \
  -p tensor-tools --bin quantize_q8k \
  ./models/tinyllama/model.safetensors \
  ./output/quantized-q8k

# Step 4: Pack into single file
cargo run --release -p tensor-tools --bin pack_q8k_safetensors \
  ./models/tinyllama/model.safetensors \
  ./output/quantized-q8k \
  ./model-q8k-packed.safetensors

# Step 5: Run inference
cargo run --release --example llama-q8k \
  ./model-q8k-packed.safetensors

✅ Current Recommendation

For immediate deployment and testing, use the Docker image. It's production-ready, fully tested, and receives regular updates. Source code will be available once the review process completes.

🙏 Acknowledgments

This project would not be possible without the incredible work of the open-source community:

Hugging Face Candle — Minimalist ML framework for Rust with blazing-fast tensor operations
TinyLlama Team — Open-source 1.1B parameter LLM trained on 3 trillion tokens
llama.cpp — Inspiration for Q8K quantization scheme and GGUF format
SafeTensors — Secure tensor serialization format
Rust Language — Memory-safe systems programming without garbage collection

Special thanks to the Candle maintainers for their responsive support and excellent documentation.

📚 Explore More

Dive deeper into the implementation, contribute improvements, or deploy your own quantized models.

View Source Code Docker Hub More Articles Contact Me