LLM Architecture Gallery | Sebastian Raschka, PhD
#LLMs

LLM Architecture Gallery | Sebastian Raschka, PhD

AI & ML Reporter
12 min read

A comprehensive visual reference of large language model architectures, featuring detailed diagrams and specifications for over 50 open-weight models from Llama to Ling.

Last updated: March 14, 2026

This page collects architecture figures and fact sheets from The Big LLM Architecture Comparison and A Dream of Spring for Open-Weight LLMs. It focuses on the architecture panels only. Click a figure to enlarge it and use the model title to jump to the corresponding article section.

If you spot an inaccurate fact sheet, mislabeled architecture, or broken link, please file an issue here: Architecture Gallery issue tracker.

Featured image


Llama 3 8B

View in article | From scratch | config.json | Tech report

Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.

SCALE 8B parameters
DATE 2024-04-18
DECODER TYPE Dense
ATTENTION GQA with RoPE
KEY DETAIL Pre-norm baseline; wider than OLMo 2 at a similar scale.
RELATED CONCEPTS GQA


OLMo 2 7B

View in article | config.json | Tech report

Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.

SCALE 7B parameters
DATE 2024-11-25
DECODER TYPE Dense
ATTENTION MHA with QK-Norm
KEY DETAIL Uses inside-residual post-norm instead of the usual pre-norm layout.
RELATED CONCEPTS QK-Norm MHA


DeepSeek V3

View in article | config.json | Tech report

DeepSeek's flagship template kicked off the recent wave of large open MoE models.

SCALE 671B total, 37B active
DATE 2024-12-26
DECODER TYPE Sparse MoE
ATTENTION MLA
KEY DETAIL Uses a dense prefix plus a shared expert to keep a very large model practical at inference.
RELATED CONCEPTS MLA MoE


DeepSeek R1

View in article | config.json | Tech report

Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.

SCALE 671B total, 37B active
DATE 2025-01-20
DECODER TYPE Sparse MoE
ATTENTION MLA
KEY DETAIL Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.
RELATED CONCEPTS MLA MoE


Gemma 3 27B

View in article | From scratch | config.json | Tech report

Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.

SCALE 27B parameters
DATE 2025-03-11
DECODER TYPE Dense
ATTENTION GQA with QK-Norm and 5:1 sliding-window/global attention
KEY DETAIL Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.
RELATED CONCEPTS QK-Norm GQA SWA


Mistral Small 3.1 24B

View in article | config.json | Tech report

Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.

SCALE 24B parameters
DATE 2025-03-18
DECODER TYPE Dense
ATTENTION Standard GQA
KEY DETAIL Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.
RELATED CONCEPTS GQA SWA


Llama 4 Maverick

View in article | config.json | Tech report

Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.

SCALE 400B total, 17B active
DATE 2025-04-05
DECODER TYPE Sparse MoE
ATTENTION GQA
KEY DETAIL Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.
RELATED CONCEPTS GQA MoE


Qwen3 235B-A22B

View in article | From scratch | config.json | Tech report

Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.

SCALE 235B total, 22B active
DATE 2025-04-28
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL High-capacity MoE design optimized for serving efficiency without a shared expert.
RELATED CONCEPTS QK-Norm GQA MoE


Qwen3 32B

View in article | From scratch | config.json | Tech report

Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.

SCALE 32B parameters
DATE 2025-04-28
DECODER TYPE Dense
ATTENTION GQA with QK-Norm
KEY DETAIL Reference dense Qwen stack with QK-Norm and 8 KV heads.
RELATED CONCEPTS QK-Norm GQA


Qwen3 4B

View in article | From scratch | config.json | Tech report

Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.

SCALE 4B parameters
DATE 2025-04-28
DECODER TYPE Dense
ATTENTION GQA with QK-Norm
KEY DETAIL Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.
RELATED CONCEPTS QK-Norm GQA


Qwen3 8B

View in article | From scratch | config.json | Tech report

Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.

SCALE 8B parameters
DATE 2025-04-28
DECODER TYPE Dense
ATTENTION GQA with QK-Norm
KEY DETAIL Reference Qwen3 dense stack with QK-Norm and 8 KV heads.
RELATED CONCEPTS QK-Norm GQA


SmolLM3 3B

View in article | config.json | Tech report

Compact dense model that experiments with leaving out positional encodings in selected layers.

SCALE 3B parameters
DATE 2025-06-19
DECODER TYPE Dense
ATTENTION GQA with periodic NoPE layers
KEY DETAIL Every fourth layer omits RoPE to test a NoPE-style cadence.
RELATED CONCEPTS NoPE GQA


Kimi K2

View in article | config.json | Tech report

Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.

SCALE 1T total, 32B active
DATE 2025-07-10
DECODER TYPE Sparse MoE
ATTENTION MLA
KEY DETAIL More experts and fewer MLA heads than DeepSeek V3.
RELATED CONCEPTS MLA MoE


GLM-4.5 355B

View in article | config.json | Tech report

Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.

SCALE 355B total, 32B active
DATE 2025-07-28
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL Starts with three dense layers before MoE routing and keeps a shared expert.
RELATED CONCEPTS QK-Norm GQA MoE


GPT-OSS 120B

View in article | config.json | Tech report

Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.

SCALE 120B parameters
DATE 2025-08-04
DECODER TYPE Sparse MoE
ATTENTION GQA with alternating sliding-window and global layers
KEY DETAIL Shared architectural template scaled up for OpenAI's flagship open-weight release.
RELATED CONCEPTS GQA SWA MoE


GPT-OSS 20B

View in article | config.json | Tech report

OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.

SCALE 20B total, 3.6B active
DATE 2025-08-04
DECODER TYPE Sparse MoE
ATTENTION GQA with alternating sliding-window and global layers
KEY DETAIL Wider and shallower than Qwen3, with attention bias and sink mechanisms.
RELATED CONCEPTS GQA SWA MoE


Grok 2.5 270B

View in article | config.json | Tech report

Rare production-model release that shows an older MoE style with fewer, larger experts.

SCALE 270B parameters
DATE 2025-08-22
DECODER TYPE Sparse MoE
ATTENTION GQA
KEY DETAIL Adds an always-on SwiGLU path that effectively behaves like a shared expert.
RELATED CONCEPTS GQA MoE


Qwen3 Next 80B-A3B

View in article | config.json | Tech report

Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.

SCALE 80B total, 3B active
DATE 2025-09-09
DECODER TYPE Sparse hybrid
ATTENTION 3:1 Gated DeltaNet and Gated Attention
KEY DETAIL Adds many more experts, a shared expert, and a native 262k context.
RELATED CONCEPTS MoE Gated Attention Gated DeltaNet


MiniMax M2 230B

View in article | config.json | Tech report

MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.

SCALE 230B total, 10B active
DATE 2025-10-23
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm and partial RoPE
KEY DETAIL Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.
RELATED CONCEPTS QK-Norm GQA MoE


Kimi Linear 48B-A3B

View in article | config.json | Tech report

Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.

SCALE 48B total, 3B active
DATE 2025-10-30
DECODER TYPE Sparse hybrid
ATTENTION 3:1 Kimi Delta Attention and MLA
KEY DETAIL Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.
RELATED CONCEPTS NoPE MLA Gated DeltaNet


OLMo 3 32B

View in article | From scratch | config.json | Tech report

Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.

SCALE 32B parameters
DATE 2025-11-20
DECODER TYPE Dense
ATTENTION GQA with QK-Norm and 3:1 sliding-window/global attention
KEY DETAIL Keeps post-norm while scaling width and applying YaRN only on global layers.
RELATED CONCEPTS QK-Norm GQA SWA


OLMo 3 7B

View in article | From scratch | config.json | Tech report

New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.

SCALE 7B parameters
DATE 2025-11-20
DECODER TYPE Dense
ATTENTION MHA with QK-Norm and 3:1 sliding-window/global attention
KEY DETAIL Retains post-norm, keeps MHA, and applies YaRN only on global layers.
RELATED CONCEPTS QK-Norm MHA SWA


DeepSeek V3.2

View in article | config.json | Tech report

DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.

SCALE 671B total, 37B active
DATE 2025-12-01
DECODER TYPE Sparse MoE
ATTENTION MLA with DeepSeek Sparse Attention
KEY DETAIL An evolutionary update focused on efficiency rather than a new base layout.
RELATED CONCEPTS MLA MoE DeepSeek Sparse Attention


Mistral 3 Large

View in article | params.json | Tech report

Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.

SCALE 673B total, 41B active
DATE 2025-12-02
DECODER TYPE Sparse MoE
ATTENTION MLA
KEY DETAIL Near-clone of DeepSeek V3 with larger experts, fewer routed experts, and multimodal support.
RELATED CONCEPTS MLA MoE


Nemotron 3 Nano 30B-A3B

View in article | config.json | Tech report

NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.

SCALE 30B total, 3B active
DATE 2025-12-04
DECODER TYPE Hybrid MoE
ATTENTION Mostly Mamba-2 with a few GQA layers
KEY DETAIL Interleaves Mamba-2 and MoE blocks, using attention only sparingly.
RELATED CONCEPTS GQA MoE


Xiaomi MiMo-V2-Flash 309B

View in article | config.json | Tech report

Large MoE model that pushes sliding-window attention harder than most contemporaries.

SCALE 309B total, 15B active
DATE 2025-12-16
DECODER TYPE Sparse MoE
ATTENTION 5:1 sliding-window/global attention
KEY DETAIL Uses an unusually small 128-token local window plus multi-token prediction.
RELATED CONCEPTS SWA MoE


GLM-4.7 355B

View in article | config.json | Tech report

Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.

SCALE 355B total, 32B active
DATE 2025-12-22
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL Serves as the pre-MLA, pre-sparse-attention baseline with the same 32B active path as GLM-4.5.
RELATED CONCEPTS QK-Norm GQA MLA MoE


Arcee AI Trinity Large 400B

View in article | config.json | Tech report

Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.

SCALE 400B total, 13B active
DATE 2026-01-27
DECODER TYPE Sparse MoE
ATTENTION GQA with gated attention and 3:1 sliding-window/global attention
KEY DETAIL Combines QK-Norm, RoPE+NoPE, sandwich norm, and a coarse-grained MoE.
RELATED CONCEPTS QK-Norm NoPE GQA SWA MoE Gated Attention


GLM-5 744B

View in article | config.json | Tech report

Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.

SCALE 744B total, 40B active
DATE 2026-02-11
DECODER TYPE Sparse MoE
ATTENTION MLA with DeepSeek Sparse Attention
KEY DETAIL Bigger than GLM-4.7, with more experts and fewer layers.
RELATED CONCEPTS MLA MoE DeepSeek Sparse Attention


Nemotron 3 Super 120B-A12B

View in article | config.json | Tech report

Super variant scales up Nano and adds both latent experts and native speculative decoding support.

SCALE 120B total, 12B active
DATE 2026-03-11
DECODER TYPE Hybrid MoE
ATTENTION Mostly Mamba-2 with a few GQA layers
KEY DETAIL Adds latent-space MoE and shared-weight MTP for fast inference.
RELATED CONCEPTS GQA LatentMoE MoE


Step 3.5 Flash 196B

View in article | config.json | Tech report

Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.

SCALE 196B total, 11B active
DATE 2026-02-01
DECODER TYPE Sparse MoE
ATTENTION GQA with 3:1 sliding-window attention
KEY DETAIL Uses MTP-3 during both training and inference for unusually high throughput.
RELATED CONCEPTS GQA SWA MoE


Nanbeige 4.1 3B

View in article | config.json | Tech report

Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.

SCALE 3B parameters
DATE 2026-02-10
DECODER TYPE Dense
ATTENTION GQA
KEY DETAIL Llama-like stack without tying input embeddings to the output layer.
RELATED CONCEPTS GQA


MiniMax M2.5 230B

View in article | config.json | Tech report

Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.

SCALE 230B total, 10B active
DATE 2026-02-12
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL Deliberately avoids sliding-window or linear-attention hybrids while keeping a 10B active path.
RELATED CONCEPTS QK-Norm GQA SWA MoE


Tiny Aya 3.35B

View in article | From scratch | config.json | Tech report

Compact multilingual model from Cohere with a rare parallel transformer block.

SCALE 3.35B parameters
DATE 2026-02-13
DECODER TYPE Dense
ATTENTION GQA with 3:1 sliding-window attention
KEY DETAIL Runs attention and the MLP in parallel while mixing RoPE with NoPE.
RELATED CONCEPTS NoPE GQA SWA


Ling 2.5 1T

View in article | config.json | Tech report

Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.

SCALE 1T total, 63B active
DATE 2026-02-15
DECODER TYPE Sparse hybrid
ATTENTION Lightning Attention plus MLA
KEY DETAIL Uses a 7:1 linear-attention/MLA ratio and a much larger 63B active path.
RELATED CONCEPTS MLA Gated DeltaNet


Qwen3.5 397B

View in article | From scratch | config.json | Tech report

Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.

SCALE 397B total, 17B active
DATE 2026-02-16
DECODER TYPE Sparse hybrid
ATTENTION 3:1 Gated DeltaNet and Gated Attention
KEY DETAIL Turns the former Qwen3-Next side branch into the new core design with 512 experts and 17B active parameters.
RELATED CONCEPTS MoE Gated Attention Gated DeltaNet


Sarvam 105B

View in article | config.json | Tech report

Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.

SCALE 105B total
DATE 2026-03-03
DECODER TYPE Sparse MoE
ATTENTION MLA with KV LayerNorm and NoPE + RoPE
KEY DETAIL Large vocabulary and strong Indic language support carried into the larger MLA-based sparse MoE variant.
RELATED CONCEPTS NoPE GQA MLA MoE


Sarvam 30B

View in article | config.json | Tech report

Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.

SCALE 30B total
DATE 2026-03-03
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL Large vocabulary and strong Indic language support paired with a reasoning-focused sparse MoE design.
RELATED CONCEPTS QK-Norm GQA MoE


SOURCE ARTICLE: The Big LLM Architecture Comparison

The original comparison article that walks through the architecture figures in context and explains the key design choices across dense, MoE, MLA, and hybrid decoder families. Read article

SOURCE ARTICLE: A Dream of Spring for Open-Weight LLMs

Follow-up article covering the additional open-weight architecture releases from early 2026, including the newer MiniMax, Qwen, Ling, and Sarvam families. Read article


This gallery serves as a visual reference for understanding the evolution of LLM architectures, from classic dense transformers to modern hybrid designs incorporating mixture-of-experts, multi-head latent attention, and state-space models. Each figure provides a snapshot of the architectural choices that define these models' capabilities and trade-offs.

Comments

Loading comments...