A comprehensive visual reference of large language model architectures, featuring detailed diagrams and specifications for over 50 open-weight models from Llama to Ling.
Last updated: March 14, 2026
This page collects architecture figures and fact sheets from The Big LLM Architecture Comparison and A Dream of Spring for Open-Weight LLMs. It focuses on the architecture panels only. Click a figure to enlarge it and use the model title to jump to the corresponding article section.
If you spot an inaccurate fact sheet, mislabeled architecture, or broken link, please file an issue here: Architecture Gallery issue tracker.

Llama 3 8B
View in article | From scratch | config.json | Tech report
Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.
SCALE 8B parameters
DATE 2024-04-18
DECODER TYPE Dense
ATTENTION GQA with RoPE
KEY DETAIL Pre-norm baseline; wider than OLMo 2 at a similar scale.
RELATED CONCEPTS GQA
OLMo 2 7B
View in article | config.json | Tech report
Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.
SCALE 7B parameters
DATE 2024-11-25
DECODER TYPE Dense
ATTENTION MHA with QK-Norm
KEY DETAIL Uses inside-residual post-norm instead of the usual pre-norm layout.
RELATED CONCEPTS QK-Norm MHA
DeepSeek V3
View in article | config.json | Tech report
DeepSeek's flagship template kicked off the recent wave of large open MoE models.
SCALE 671B total, 37B active
DATE 2024-12-26
DECODER TYPE Sparse MoE
ATTENTION MLA
KEY DETAIL Uses a dense prefix plus a shared expert to keep a very large model practical at inference.
RELATED CONCEPTS MLA MoE
DeepSeek R1
View in article | config.json | Tech report
Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.
SCALE 671B total, 37B active
DATE 2025-01-20
DECODER TYPE Sparse MoE
ATTENTION MLA
KEY DETAIL Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.
RELATED CONCEPTS MLA MoE
Gemma 3 27B
View in article | From scratch | config.json | Tech report
Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.
SCALE 27B parameters
DATE 2025-03-11
DECODER TYPE Dense
ATTENTION GQA with QK-Norm and 5:1 sliding-window/global attention
KEY DETAIL Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.
RELATED CONCEPTS QK-Norm GQA SWA
Mistral Small 3.1 24B
View in article | config.json | Tech report
Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.
SCALE 24B parameters
DATE 2025-03-18
DECODER TYPE Dense
ATTENTION Standard GQA
KEY DETAIL Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.
RELATED CONCEPTS GQA SWA
Llama 4 Maverick
View in article | config.json | Tech report
Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.
SCALE 400B total, 17B active
DATE 2025-04-05
DECODER TYPE Sparse MoE
ATTENTION GQA
KEY DETAIL Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.
RELATED CONCEPTS GQA MoE
Qwen3 235B-A22B
View in article | From scratch | config.json | Tech report
Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.
SCALE 235B total, 22B active
DATE 2025-04-28
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL High-capacity MoE design optimized for serving efficiency without a shared expert.
RELATED CONCEPTS QK-Norm GQA MoE
Qwen3 32B
View in article | From scratch | config.json | Tech report
Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.
SCALE 32B parameters
DATE 2025-04-28
DECODER TYPE Dense
ATTENTION GQA with QK-Norm
KEY DETAIL Reference dense Qwen stack with QK-Norm and 8 KV heads.
RELATED CONCEPTS QK-Norm GQA
Qwen3 4B
View in article | From scratch | config.json | Tech report
Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.
SCALE 4B parameters
DATE 2025-04-28
DECODER TYPE Dense
ATTENTION GQA with QK-Norm
KEY DETAIL Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.
RELATED CONCEPTS QK-Norm GQA
Qwen3 8B
View in article | From scratch | config.json | Tech report
Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.
SCALE 8B parameters
DATE 2025-04-28
DECODER TYPE Dense
ATTENTION GQA with QK-Norm
KEY DETAIL Reference Qwen3 dense stack with QK-Norm and 8 KV heads.
RELATED CONCEPTS QK-Norm GQA
SmolLM3 3B
View in article | config.json | Tech report
Compact dense model that experiments with leaving out positional encodings in selected layers.
SCALE 3B parameters
DATE 2025-06-19
DECODER TYPE Dense
ATTENTION GQA with periodic NoPE layers
KEY DETAIL Every fourth layer omits RoPE to test a NoPE-style cadence.
RELATED CONCEPTS NoPE GQA
Kimi K2
View in article | config.json | Tech report
Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.
SCALE 1T total, 32B active
DATE 2025-07-10
DECODER TYPE Sparse MoE
ATTENTION MLA
KEY DETAIL More experts and fewer MLA heads than DeepSeek V3.
RELATED CONCEPTS MLA MoE
GLM-4.5 355B
View in article | config.json | Tech report
Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.
SCALE 355B total, 32B active
DATE 2025-07-28
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL Starts with three dense layers before MoE routing and keeps a shared expert.
RELATED CONCEPTS QK-Norm GQA MoE
GPT-OSS 120B
View in article | config.json | Tech report
Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.
SCALE 120B parameters
DATE 2025-08-04
DECODER TYPE Sparse MoE
ATTENTION GQA with alternating sliding-window and global layers
KEY DETAIL Shared architectural template scaled up for OpenAI's flagship open-weight release.
RELATED CONCEPTS GQA SWA MoE
GPT-OSS 20B
View in article | config.json | Tech report
OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.
SCALE 20B total, 3.6B active
DATE 2025-08-04
DECODER TYPE Sparse MoE
ATTENTION GQA with alternating sliding-window and global layers
KEY DETAIL Wider and shallower than Qwen3, with attention bias and sink mechanisms.
RELATED CONCEPTS GQA SWA MoE
Grok 2.5 270B
View in article | config.json | Tech report
Rare production-model release that shows an older MoE style with fewer, larger experts.
SCALE 270B parameters
DATE 2025-08-22
DECODER TYPE Sparse MoE
ATTENTION GQA
KEY DETAIL Adds an always-on SwiGLU path that effectively behaves like a shared expert.
RELATED CONCEPTS GQA MoE
Qwen3 Next 80B-A3B
View in article | config.json | Tech report
Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.
SCALE 80B total, 3B active
DATE 2025-09-09
DECODER TYPE Sparse hybrid
ATTENTION 3:1 Gated DeltaNet and Gated Attention
KEY DETAIL Adds many more experts, a shared expert, and a native 262k context.
RELATED CONCEPTS MoE Gated Attention Gated DeltaNet
MiniMax M2 230B
View in article | config.json | Tech report
MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.
SCALE 230B total, 10B active
DATE 2025-10-23
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm and partial RoPE
KEY DETAIL Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.
RELATED CONCEPTS QK-Norm GQA MoE
Kimi Linear 48B-A3B
View in article | config.json | Tech report
Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.
SCALE 48B total, 3B active
DATE 2025-10-30
DECODER TYPE Sparse hybrid
ATTENTION 3:1 Kimi Delta Attention and MLA
KEY DETAIL Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.
RELATED CONCEPTS NoPE MLA Gated DeltaNet
OLMo 3 32B
View in article | From scratch | config.json | Tech report
Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.
SCALE 32B parameters
DATE 2025-11-20
DECODER TYPE Dense
ATTENTION GQA with QK-Norm and 3:1 sliding-window/global attention
KEY DETAIL Keeps post-norm while scaling width and applying YaRN only on global layers.
RELATED CONCEPTS QK-Norm GQA SWA
OLMo 3 7B
View in article | From scratch | config.json | Tech report
New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.
SCALE 7B parameters
DATE 2025-11-20
DECODER TYPE Dense
ATTENTION MHA with QK-Norm and 3:1 sliding-window/global attention
KEY DETAIL Retains post-norm, keeps MHA, and applies YaRN only on global layers.
RELATED CONCEPTS QK-Norm MHA SWA
DeepSeek V3.2
View in article | config.json | Tech report
DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.
SCALE 671B total, 37B active
DATE 2025-12-01
DECODER TYPE Sparse MoE
ATTENTION MLA with DeepSeek Sparse Attention
KEY DETAIL An evolutionary update focused on efficiency rather than a new base layout.
RELATED CONCEPTS MLA MoE DeepSeek Sparse Attention
Mistral 3 Large
View in article | params.json | Tech report
Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.
SCALE 673B total, 41B active
DATE 2025-12-02
DECODER TYPE Sparse MoE
ATTENTION MLA
KEY DETAIL Near-clone of DeepSeek V3 with larger experts, fewer routed experts, and multimodal support.
RELATED CONCEPTS MLA MoE
Nemotron 3 Nano 30B-A3B
View in article | config.json | Tech report
NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.
SCALE 30B total, 3B active
DATE 2025-12-04
DECODER TYPE Hybrid MoE
ATTENTION Mostly Mamba-2 with a few GQA layers
KEY DETAIL Interleaves Mamba-2 and MoE blocks, using attention only sparingly.
RELATED CONCEPTS GQA MoE
Xiaomi MiMo-V2-Flash 309B
View in article | config.json | Tech report
Large MoE model that pushes sliding-window attention harder than most contemporaries.
SCALE 309B total, 15B active
DATE 2025-12-16
DECODER TYPE Sparse MoE
ATTENTION 5:1 sliding-window/global attention
KEY DETAIL Uses an unusually small 128-token local window plus multi-token prediction.
RELATED CONCEPTS SWA MoE
GLM-4.7 355B
View in article | config.json | Tech report
Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.
SCALE 355B total, 32B active
DATE 2025-12-22
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL Serves as the pre-MLA, pre-sparse-attention baseline with the same 32B active path as GLM-4.5.
RELATED CONCEPTS QK-Norm GQA MLA MoE
Arcee AI Trinity Large 400B
View in article | config.json | Tech report
Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.
SCALE 400B total, 13B active
DATE 2026-01-27
DECODER TYPE Sparse MoE
ATTENTION GQA with gated attention and 3:1 sliding-window/global attention
KEY DETAIL Combines QK-Norm, RoPE+NoPE, sandwich norm, and a coarse-grained MoE.
RELATED CONCEPTS QK-Norm NoPE GQA SWA MoE Gated Attention
GLM-5 744B
View in article | config.json | Tech report
Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.
SCALE 744B total, 40B active
DATE 2026-02-11
DECODER TYPE Sparse MoE
ATTENTION MLA with DeepSeek Sparse Attention
KEY DETAIL Bigger than GLM-4.7, with more experts and fewer layers.
RELATED CONCEPTS MLA MoE DeepSeek Sparse Attention
Nemotron 3 Super 120B-A12B
View in article | config.json | Tech report
Super variant scales up Nano and adds both latent experts and native speculative decoding support.
SCALE 120B total, 12B active
DATE 2026-03-11
DECODER TYPE Hybrid MoE
ATTENTION Mostly Mamba-2 with a few GQA layers
KEY DETAIL Adds latent-space MoE and shared-weight MTP for fast inference.
RELATED CONCEPTS GQA LatentMoE MoE
Step 3.5 Flash 196B
View in article | config.json | Tech report
Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.
SCALE 196B total, 11B active
DATE 2026-02-01
DECODER TYPE Sparse MoE
ATTENTION GQA with 3:1 sliding-window attention
KEY DETAIL Uses MTP-3 during both training and inference for unusually high throughput.
RELATED CONCEPTS GQA SWA MoE
Nanbeige 4.1 3B
View in article | config.json | Tech report
Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.
SCALE 3B parameters
DATE 2026-02-10
DECODER TYPE Dense
ATTENTION GQA
KEY DETAIL Llama-like stack without tying input embeddings to the output layer.
RELATED CONCEPTS GQA
MiniMax M2.5 230B
View in article | config.json | Tech report
Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.
SCALE 230B total, 10B active
DATE 2026-02-12
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL Deliberately avoids sliding-window or linear-attention hybrids while keeping a 10B active path.
RELATED CONCEPTS QK-Norm GQA SWA MoE
Tiny Aya 3.35B
View in article | From scratch | config.json | Tech report
Compact multilingual model from Cohere with a rare parallel transformer block.
SCALE 3.35B parameters
DATE 2026-02-13
DECODER TYPE Dense
ATTENTION GQA with 3:1 sliding-window attention
KEY DETAIL Runs attention and the MLP in parallel while mixing RoPE with NoPE.
RELATED CONCEPTS NoPE GQA SWA
Ling 2.5 1T
View in article | config.json | Tech report
Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.
SCALE 1T total, 63B active
DATE 2026-02-15
DECODER TYPE Sparse hybrid
ATTENTION Lightning Attention plus MLA
KEY DETAIL Uses a 7:1 linear-attention/MLA ratio and a much larger 63B active path.
RELATED CONCEPTS MLA Gated DeltaNet
Qwen3.5 397B
View in article | From scratch | config.json | Tech report
Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.
SCALE 397B total, 17B active
DATE 2026-02-16
DECODER TYPE Sparse hybrid
ATTENTION 3:1 Gated DeltaNet and Gated Attention
KEY DETAIL Turns the former Qwen3-Next side branch into the new core design with 512 experts and 17B active parameters.
RELATED CONCEPTS MoE Gated Attention Gated DeltaNet
Sarvam 105B
View in article | config.json | Tech report
Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.
SCALE 105B total
DATE 2026-03-03
DECODER TYPE Sparse MoE
ATTENTION MLA with KV LayerNorm and NoPE + RoPE
KEY DETAIL Large vocabulary and strong Indic language support carried into the larger MLA-based sparse MoE variant.
RELATED CONCEPTS NoPE GQA MLA MoE
Sarvam 30B
View in article | config.json | Tech report
Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.
SCALE 30B total
DATE 2026-03-03
DECODER TYPE Sparse MoE
ATTENTION GQA with QK-Norm
KEY DETAIL Large vocabulary and strong Indic language support paired with a reasoning-focused sparse MoE design.
RELATED CONCEPTS QK-Norm GQA MoE
SOURCE ARTICLE: The Big LLM Architecture Comparison
The original comparison article that walks through the architecture figures in context and explains the key design choices across dense, MoE, MLA, and hybrid decoder families. Read article
SOURCE ARTICLE: A Dream of Spring for Open-Weight LLMs
Follow-up article covering the additional open-weight architecture releases from early 2026, including the newer MiniMax, Qwen, Ling, and Sarvam families. Read article
This gallery serves as a visual reference for understanding the evolution of LLM architectures, from classic dense transformers to modern hybrid designs incorporating mixture-of-experts, multi-head latent attention, and state-space models. Each figure provides a snapshot of the architectural choices that define these models' capabilities and trade-offs.

Comments
Please log in or register to join the discussion