Model Quantization from First Principles

Verified Sources

Jun 19, 2026

Model quantization is the process of replacing high-precision numerical representations, commonly $32$ -bit floating point, with lower-precision formats such as $16$ -bit floating point, $8$ -bit integers, or $4$ -bit integers to reduce memory, bandwidth, and compute cost during neural network inference. At its core, quantization is not a trick specific to neural networks; it is an application of numerical approximation: map a continuous or high-resolution set of real values into a finite set of discrete values, then compute with those discrete values as efficiently as possible.

A neural network layer computes transformations such as:

y = Wx + b

where $W$ is a matrix of weights, $x$ is an activation, and $b$ is a bias term. In full precision, $W$ , $x$ , and $b$ are often stored as floating-point numbers. Quantization asks: can we approximate $W$ and $x$ using integers while keeping the output $y$ close enough for the task?

The most common production scheme is affine quantization, which maps a real value $r$ to an integer value $q$ through a scale $s$ and zero point $z$ :

q = \operatorname{round}\left(\frac{r}{s}\right) + z

and reconstructs an approximate real value as:

\hat{r} = s(q - z)

The central idea is therefore simple: choose $s$ and $z$ so that important real values fit into a small integer range, such as $[-128,127]$ for signed $8$ -bit integers or $[0,255]$ for unsigned $8$ -bit integers.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩ ↩²
Quantization in Digital Signal Processing - Background on mapping continuous or high-resolution values to a finite set of discrete values. ↩
TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩ ↩²

Understanding int8 Neural Network Quantization

First-Principles Mental Model

Quantization is controlled information loss. The goal is not to make every number exact; it is to preserve the model’s input-output behavior while reducing storage, memory bandwidth, and arithmetic cost.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩

Why Quantization Works

Neural networks are often robust to small numerical perturbations because many learned representations are distributed across many parameters and activations. This means that replacing a value $r$ with a nearby approximation $\hat{r}$ may not significantly change the final prediction if the induced quantization error remains small relative to the model’s margins and layer sensitivities.

For a uniform quantizer, the real line is divided into equal-width intervals. If values are clipped to a representable interval $[r_{\min}, r_{\max}]$ and encoded with $N$ integer levels, the step size is approximately:

s = \frac{r_{\max} - r_{\min}}{N - 1}

For $8$ -bit unsigned quantization, $N = 256$ ; for signed $8$ -bit quantization, there are also $256$ distinct integer codes. Smaller $s$ means finer resolution but a narrower representable range. Larger $s$ covers a wider range but increases rounding error. Quantization is therefore a trade-off between clipping error and rounding error.

A typical scalar quantization pipeline is:

r \rightarrow q = \operatorname{clip}\left(\operatorname{round}\left(\frac{r}{s}\right) + z, q_{\min}, q_{\max}\right) \rightarrow \hat{r} = s(q - z)

The clipping operation ensures that $q$ remains inside the valid integer range, such as $q_{\min}=-128$ and $q_{\max}=127$ for signed $8$ -bit tensors.

Concept	Mathematical Role	Practical Effect
Scale $s$	Determines spacing between representable real values	Smaller $s$ improves precision but narrows dynamic range
Zero point $z$	Ensures real zero is exactly representable	Important for padding and efficient integer arithmetic
Clipping bounds	Define $r_{\min}$ and $r_{\max}$	Prevent overflow but may saturate outliers
Bit width $b$	Gives approximately $2^b$ codes	Lower $b$ saves memory but increases error

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩ ↩²
TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩ ↩²
Quantization in Digital Signal Processing - Background on mapping continuous or high-resolution values to a finite set of discrete values. ↩

Relative Storage Cost by Numeric Format

Lower bit widths reduce parameter storage approximately in proportion to the number of bits used per value; practical savings also depend on packing, metadata, kernels, and hardware support.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩

The Core Equation: Real Matrix Multiplication with Integer Arithmetic

Consider a linear layer:

y = Wx

If both $W$ and $x$ are quantized, then:

W \approx s_W(q_W - z_W)

x \approx s_x(q_x - z_x)

Substituting into the matrix multiplication gives:

y \approx s_W s_x (q_W - z_W)(q_x - z_x)

The expensive part can now be expressed as integer multiply-accumulate operations, often accumulating into $32$ -bit integers before rescaling to the next layer’s output format. This is the foundation of integer-only inference, a major reason quantized inference can be faster on supported CPUs, DSPs, NPUs, and mobile accelerators.

For a dot product of length $n$ :

y = \sum_{i=1}^{n} w_i x_i

the quantized approximation is:

y \approx s_W s_x \sum_{i=1}^{n} (q_{w_i} - z_W)(q_{x_i} - z_x)

In practice, optimized libraries avoid unnecessary per-element subtraction by algebraically expanding the terms and precomputing sums where possible.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩ ↩² ↩³

Zero Is Special

A quantization scheme should represent real zero exactly because zero padding, sparse values, and common neural-network operations depend on preserving zero behavior.

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩

Affine Quantization from Scratch

1
Step 1
Select the target numeric format. For signed $8$ -bit quantization, a common range is $q_{\min}=-128$ and $q_{\max}=127$ ; for unsigned $8$ -bit quantization, it is commonly $q_{\min}=0$ and $q_{\max}=255$ .

Footnotes

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩
2
Step 2
Estimate $r_{\min}$ and $r_{\max}$ from the tensor being quantized. For weights, this can be computed directly from trained parameters; for activations, it is usually estimated from representative calibration data.

Footnotes

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩
3
Step 3
Use $s = \frac{r_{\max}-r_{\min}}{q_{\max}-q_{\min}}$ . The scale is the real-value distance represented by one integer step.

Footnotes

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩
4
Step 4
Use $z = q_{\min} - \operatorname{round}\left(\frac{r_{\min}}{s}\right)$ , then clamp $z$ into $[q_{\min}, q_{\max}]$ . This makes real zero map as closely as possible to an integer code.

Footnotes

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩
5
Step 5
For each real value $r$ , compute $q = \operatorname{clip}\left(\operatorname{round}\left(\frac{r}{s}\right)+z, q_{\min}, q_{\max}\right)$ .

Footnotes

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩
6
Step 6
Recover an approximate value with $\hat{r}=s(q-z)$ . In optimized inference, many operations avoid full dequantization and instead rescale integer accumulators between layers.

Footnotes

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩

Symmetric vs Asymmetric Quantization

Two common forms of uniform quantization are symmetric quantization and asymmetric quantization.

In symmetric quantization, the real range is centered around zero, and the zero point is usually fixed to $0$ for signed integer formats:

\hat{r} = s q

This is especially common for weights because trained weights are often roughly centered around zero, and using $z=0$ simplifies arithmetic.

In asymmetric quantization, the range need not be centered around zero:

\hat{r} = s(q-z)

This is useful for activations after functions such as ReLU, where values may be mostly nonnegative and the distribution is shifted away from zero.

Scheme	Formula	Typical Use	Advantage	Trade-off
Symmetric	$\hat{r}=sq$	Weights	Simpler arithmetic	May waste codes if distribution is shifted
Asymmetric	$\hat{r}=s(q-z)$	Activations	Better fit for shifted ranges	Extra zero-point handling
Per-tensor	One $s,z$ for whole tensor	Simple deployment	Low metadata overhead	Sensitive to channel outliers
Per-channel	Separate $s,z$ per output channel	Convolution and linear weights	Better accuracy for uneven channels	More metadata and kernel complexity

Per-channel quantization often improves accuracy for weights because different output channels can have very different ranges. TensorFlow Lite’s quantization specification, for example, supports per-axis quantization for certain weight tensors and distinguishes activation and weight constraints.

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩ ↩² ↩³ ↩⁴ ↩⁵

Best when values are distributed around zero. A common weight formula is $q=\operatorname{round}(r/s)$ and $\hat{r}=sq$ . This reduces arithmetic overhead because the zero point is fixed.

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩

Calibration: Estimating Activation Ranges

Weights are known after training, but activations depend on input data. Calibration runs sample inputs through the model to observe activation ranges before finalizing quantization parameters.

For post-training static quantization, a representative dataset is passed through the model while observers record tensor statistics such as minimum and maximum values or histograms. These statistics determine scales and zero points for activation tensors.

A naive min-max calibration strategy uses:

r_{\min} = \min(x)

r_{\max} = \max(x)

over observed calibration activations. However, outliers can make $r_{\max}-r_{\min}$ very large, increasing $s$ and reducing precision for the majority of values. More advanced calibration strategies may use percentiles or histogram-based criteria to balance clipping error and rounding error.

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩ ↩² ↩³ ↩⁴

Post-Training Static Quantization Workflow

1
Step 1
Use a converged floating-point model as the baseline. Quantization modifies numeric representation but does not necessarily retrain the model.

Footnotes

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩
2
Step 2
Combine patterns such as convolution, batch normalization, and ReLU where the framework supports it. Fusion can reduce memory traffic and improve quantized kernel efficiency.

Footnotes

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩
3
Step 3
Attach observer modules to collect weight and activation statistics during calibration. PyTorch quantization workflows use observers to determine quantization parameters.

Footnotes

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩
4
Step 4
Feed inputs that approximate deployment data. Calibration quality matters because activation ranges should reflect real inference conditions.

Footnotes

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩
5
Step 5
Replace eligible floating-point operations with quantized equivalents using the collected scales and zero points.

Footnotes

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩
6
Step 6
Compare task metrics against the floating-point baseline. If accuracy loss is excessive, inspect sensitive layers, outliers, calibration data, and whether quantization-aware training is needed.

Footnotes

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩

Calibration Data Should Match Deployment

A small but representative calibration set is often more useful than a large mismatched one. Quantized activation ranges are only as good as the data used to estimate them.

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩

Dynamic, Static, and Quantization-Aware Training

There are three major deployment patterns: dynamic quantization, static quantization, and quantization-aware training.

Dynamic quantization quantizes weights in advance but computes activation scales at runtime. This is commonly useful for models dominated by linear layers, such as recurrent networks and transformer components, because it reduces weight memory while avoiding calibration of every activation tensor.

Static quantization quantizes both weights and activations before deployment using calibration. It can provide stronger speedups on hardware with integer kernels because both operands in major matrix multiplications can be low precision.

Quantization-aware training, or QAT, simulates quantization during training by inserting fake-quantization operations into the forward pass while maintaining trainable parameters in floating point. This allows the model to adapt to quantization noise and often improves accuracy when low-bit post-training quantization is too lossy.

Method	Weights	Activations	Training Required?	Typical Benefit
Dynamic quantization	Quantized before inference	Quantized at runtime	No	Simple memory reduction and CPU speedups for linear-heavy models
Static quantization	Quantized before inference	Quantized from calibration	No	Efficient integer inference on supported kernels
Quantization-aware training	Simulated during training, exported later	Simulated during training, exported later	Yes	Better accuracy under aggressive quantization

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩ ↩² ↩³
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩ ↩²

Practical Quantization Roadmap

Baseline Evaluation

Stage 1

Measure the floating-point model’s accuracy, latency, memory, and throughput before applying quantization."

Post-Training Dynamic Quantization

Stage 2

Try weight-focused dynamic quantization first when the model is dominated by linear layers and deployment simplicity is important."

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩

Post-Training Static Quantization

Stage 3

Use calibration to quantize both weights and activations when target hardware has efficient integer kernels."

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩

Layerwise Diagnosis

Stage 4

Identify layers with high error, outlier channels, or unstable activation ranges."

Quantization-Aware Training

Stage 5

Fine-tune with simulated quantization when post-training approaches do not meet accuracy requirements."

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩

Hardware-Specific Deployment

Stage 6

Validate that the exported model uses kernels supported by the target runtime, such as mobile, edge, server CPU, GPU, NPU, or DSP backends."

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩

Quantization Error: A First-Principles View

Quantization replaces $r$ with $\hat{r}$ , so the scalar error is:

e = r - \hat{r}

For an ideal uniform rounding quantizer without clipping, the maximum absolute rounding error is approximately:

|e| \leq \frac{s}{2}

This bound explains why smaller scale values improve precision. However, reducing $s$ while keeping the same bit width shrinks the representable range and can increase clipping. Thus total error can be understood as:

\text{total error} = \text{rounding error} + \text{clipping error}

In neural networks, error compounds through layers. If one layer’s output is badly quantized, the next layer receives a distorted input distribution. This is why activation quantization is often harder than weight-only quantization: activations vary by input, layer, batch, and deployment distribution.

A useful layerwise approximation compares floating-point and quantized outputs:

\Delta y = y_{\text{float}} - y_{\text{quant}}

Large $\|\Delta y\|$ at a layer indicates that the layer may need higher precision, per-channel quantization, better calibration, clipping adjustment, or QAT.

PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩ ↩²

Common Failure Modes and Fixes

Weight-Only Quantization and Large Language Models

For large language models, weight-only quantization is widely used because model parameters consume large amounts of memory, and reducing weight precision can substantially reduce memory footprint. For example, moving from $16$ -bit weights to $4$ -bit weights can reduce raw weight storage by approximately $4\times$ , ignoring scale metadata and packing overhead.

In transformer inference, memory bandwidth is often a major bottleneck, especially when serving large models with many parameters. Weight-only quantization reduces the amount of weight data read from memory while often keeping activations and accumulations in higher precision for stability.

Modern LLM quantization methods frequently use grouped quantization. Instead of one scale for an entire tensor or one scale per channel, a scale may be assigned to a small group of weights:

W_g \approx s_g q_g

where $g$ indexes a group. Smaller groups improve local accuracy but increase metadata overhead because more scales must be stored.

Some advanced methods, such as GPTQ, use second-order approximations to quantize weights while compensating for quantization error layer by layer. Other approaches, such as activation-aware weight quantization, identify salient weights based on activation statistics and protect them during quantization.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Research paper on accurate post-training weight quantization for large transformer models. ↩ ↩² ↩³ ↩⁴
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - Research paper describing activation-aware protection of salient weights during low-bit LLM quantization. ↩

import numpy as np

def quantize_affine(x, qmin=-128, qmax=127):
    rmin = float(np.min(x))
    rmax = float(np.max(x))

    if rmax == rmin:
        scale = 1.0
        zero_point = 0
        q = np.zeros_like(x, dtype=np.int8)
        return q, scale, zero_point

    scale = (rmax - rmin) / (qmax - qmin)
    zero_point = qmin - round(rmin / scale)
    zero_point = int(np.clip(zero_point, qmin, qmax))

    q = np.round(x / scale + zero_point)
    q = np.clip(q, qmin, qmax).astype(np.int8)
    return q, scale, zero_point

def dequantize_affine(q, scale, zero_point):
    return scale * (q.astype(np.float32) - zero_point)

Quantized Size Does Not Guarantee Quantized Speed

A model can be smaller after quantization but not faster if operators fall back to floating-point kernels, if dequantization is inserted too often, or if the target hardware lacks efficient support for the chosen low-precision format.

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩

Practical Design Choices

A quantization strategy is a set of engineering decisions, not a single formula. The main choices are:

Bit width: $8$ -bit quantization is often a strong accuracy-efficiency trade-off for many neural networks, while $4$ -bit quantization is more aggressive and usually needs more careful methods.
Granularity: Per-tensor quantization has low overhead; per-channel or grouped quantization improves accuracy when distributions differ across channels or groups.
Symmetry: Symmetric quantization simplifies arithmetic; asymmetric quantization better handles shifted distributions.
Calibration method: Min-max calibration is simple; histogram or percentile methods may better handle outliers.
Training involvement: Post-training quantization is simpler; QAT can recover accuracy by exposing the model to simulated quantization noise during optimization.
Hardware target: The best quantization scheme depends on what the deployment runtime accelerates efficiently.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Foundational paper describing integer-only neural network inference and quantization-aware training methods. ↩
TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩ ↩² ↩³
PyTorch Quantization Documentation - Framework documentation covering observers, calibration, dynamic quantization, static quantization, and quantization-aware training. ↩ ↩²

Key Quantization Concepts

1 / 5

20%

Question · Term

What is a scale?

Click to reveal

Answer · Definition

The real-value step size represented by moving one integer code. In affine quantization, $\hat{r}=s(q-z)$ .

Evaluation Checklist

A rigorous quantization evaluation should compare the quantized model against the floating-point baseline across multiple dimensions:

Evaluation Axis	Question	Diagnostic Signal
Accuracy	Does the task metric remain acceptable?	Top- $1$ accuracy, F1, BLEU, perplexity, or domain metric
Latency	Is inference actually faster?	End-to-end latency on target hardware
Throughput	Can more requests be served per second?	Tokens per second, images per second, queries per second
Memory	Is model storage or runtime memory reduced?	File size, resident memory, activation memory
Numerical stability	Are certain layers causing large deviations?	Layerwise $\\|\Delta y\\|$ , saturation rate, clipping frequency
Portability	Does the runtime support the chosen operators?	Kernel coverage and fallback logs

The most important principle is to evaluate on the deployment path, not only in a development notebook. Quantization changes numerical formats, but real-world performance depends on graph conversion, operator fusion, memory layout, kernel availability, and hardware execution.

TensorFlow Lite 8-bit Quantization Specification - Specification describing scales, zero points, signed integer ranges, per-axis quantization, and operator constraints. ↩

Knowledge Check

Question 1 of 5

Q1Single choice

In affine quantization, which equation correctly reconstructs an approximate real value from an integer code?

$\hat{r}=s(q-z)$

$\hat{r}=q+s+z$

$\hat{r}=z(q-s)$

$\hat{r}=\operatorname{round}(s+q)$

Explore Related Topics

Design Metrics and Tight Constraints in Embedded Systems

Embedded system design is governed by three tight constraints—physical footprint, low power/thermal limits, and deterministic real‑time execution—requiring simultaneous hardware‑software co‑optimization. Design metrics such as cost, time‑to‑market, and reliability guide trade‑offs among microcontrollers, SoCs, and FPGAs.

Single‑chip integration cuts area and NRE cost but restricts memory and peripherals.
Dynamic power = α·C·V²·f; higher frequency improves latency but raises power and heat.
Hard real‑time designs require guaranteed deadlines and low jitter; missed deadlines equal failure.
Bare‑metal gives minimal power and size; RTOS adds multitasking support with higher overhead.

The Multiverse Hypothesis: Physics, Mathematics, and Cosmology

The course surveys the multiverse hypothesis, showing how cosmic inflation, quantum mechanics, and string theory naturally lead to multiple universe models and detailing Max Tegmark’s four‑level taxonomy alongside Brian Greene’s nine‑type classification.

Level I: Infinite space ( $V\to\infty$ ) with the same physical laws but different initial conditions.
Level II: Bubbles from eternal inflation ( $e^{Ht}$ ) where constants such as $m_e$ or $\alpha$ vary.
Level III: Quantum many‑worlds where the universal wave function $\Psi$ never collapses, creating decoherent branches.
Level IV: Every self‑consistent mathematical structure corresponds to a physical universe.
Critique: Multiverse theories are often deemed unscientific because they lack experimental falsifiability, as other universes are causally disconnected.

OSI Model

The OSI model is a seven‑layer framework that defines data flow, encapsulation, and troubleshooting across networks.

Layers 1‑7 progress from raw bits to user services; examples include Ethernet, IP, TCP, and HTTP.
Encapsulation adds a header (and optional trailer) at each layer: $Payload + Header_n + Trailer_n$ , ending as bits.
The model enables layered troubleshooting; e.g., Layer 3 problems involve routing/IP, Layer 7 involve application protocols.
Compared to TCP/IP, OSI splits functions into more layers; Session and Presentation map into TCP/IP’s Application layer.

Browse all research articles

Model Quantization from First Principles

AI Summary

Footnotes

Understanding int8 Neural Network Quantization

First-Principles Mental Model

Footnotes

Why Quantization Works

Footnotes

Relative Storage Cost by Numeric Format

Footnotes

The Core Equation: Real Matrix Multiplication with Integer Arithmetic

Footnotes

Zero Is Special

Footnotes

Affine Quantization from Scratch

Footnotes

Footnotes

Footnotes

Footnotes

Footnotes

Footnotes

Symmetric vs Asymmetric Quantization

Footnotes

Footnotes

Calibration: Estimating Activation Ranges

Footnotes

Post-Training Static Quantization Workflow

Footnotes

Footnotes

Footnotes

Footnotes

Footnotes

Footnotes

Calibration Data Should Match Deployment

Footnotes

Dynamic, Static, and Quantization-Aware Training

Footnotes

Practical Quantization Roadmap

Baseline Evaluation

Post-Training Dynamic Quantization

Footnotes

Post-Training Static Quantization

Footnotes

Layerwise Diagnosis

Quantization-Aware Training

Footnotes

Hardware-Specific Deployment

Footnotes

Quantization Error: A First-Principles View

Footnotes

Common Failure Modes and Fixes

Weight-Only Quantization and Large Language Models

Footnotes

Quantized Size Does Not Guarantee Quantized Speed

Footnotes

Practical Design Choices

Footnotes

Key Quantization Concepts

What is a scale?

Evaluation Checklist

Footnotes

Knowledge Check

Explore Related Topics