Your ORM is the bottleneck: A 23× Throughput Gap on SQLite

A 9.4-hour benchmark of SQLAlchemy ORM vs raw sqlite3 executemany reveals that ORM overhead dominates SQLite performance—PRAGMA tuning moves throughput by ≤26%, while bypassing the ORM yields 23× higher write throughput. The ORM, not the database, is the ceiling.

SQLite is having a moment. Rails 8 ships with it as a first-class production database. Expensify runs 4 million queries per second on a single SQLite file. Cloudflare D1 serves edge queries with 8 ms p99 reads. So when I sat down to tune my SQLite deployment, I assumed the database would be the bottleneck.

I was wrong.

Over 9.4 hours of continuous benchmarking—230 million rows, 11 configurations, 37 datapoints—I discovered something stark: on a standard SQLAlchemy ORM workload, SQLite was never the limiting factor. The ORM was. Always.

The headline numbers

Path	10M rows	50M rows	p99 latency
SQLAlchemy ORM	3,696 r/s (45 min)	3,682 r/s (3.8 hrs)	1,492 ms
Raw `executemany`	87,893 r/s (1.9 min)	65,742 r/s (12.7 min)	478 ms
Speedup	23.8×	17.9×	3.1×

Zero errors across all runs. Every config. Every scale.

This isn’t a synthetic benchmark. It’s a production-like ingestion pipeline: UUID primary key, 4-column upsert conflict target, 11 columns (~500 bytes per row), streaming generator, NVMe storage, Linux 6.8, Python 3.11, SQLAlchemy 2.0 sync mode.

The ORM path used bulk_save_objects and insert().on_conflict_do_update(). The raw path used sqlite3.executemany via the DBAPI connection—no SQLAlchemy involved.

Why does the ORM cost 23× throughput? Because every row goes through 7 layers of Python indirection: object instantiation, attribute instrumentation, identity map management, SQL compilation, parameter binding, statement caching, and transaction bookkeeping. Each layer adds latency. On 10 million rows, that’s 43 minutes of CPU you never get back.

ORM stack: 7 layers per row

What I actually benchmarked

Hardware: 8-core/16-thread CPU, 23.2 GB RAM, Samsung 990 EVO Plus 1TB NVMe, Linux 6.8.0, no Docker.

Schema: One table, 11 columns, UUID PK, indexed (tenant_id, entity_id, sub_entity_id, bucket_index) for upsert conflict resolution.

Data generation: Streaming generator yielding deterministic chunks. Constant memory regardless of scale. No accumulation. This validates that memory usage reflects true overhead, not data buildup.

Two harnesses, two questions:

Config sweep: 11 SQLite configurations × 10M rows → Does PRAGMA tuning matter?
Before/after: ORM vs raw at 10M and 50M → How much does the ORM cost?

The config sweep: 11 configurations, 110M rows

Config	Chunk size	sync	cache	pool	mmap	Throughput (r/s)	p99 latency
chunk_1000	1,000	NORMAL	-64K	5	256MB	3,821	313 ms
baseline	5,000	NORMAL	-4K	5	0	3,802	1,418 ms
chunk_5000	5,000	NORMAL	-64K	5	256MB	3,673	1,500 ms
chunk_10000	10,000	NORMAL	-64K	5	256MB	3,621	2,887 ms
optimized	10,000	NORMAL	-64K	5	256MB	3,576	2,982 ms
pool_5	10,000	NORMAL	-64K	5	256MB	3,574	2,960 ms
pool_8	10,000	NORMAL	-64K	8	256MB	3,573	2,957 ms
pool_3	10,000	NORMAL	-64K	3	256MB	3,445	3,200 ms
chunk_25000	25,000	NORMAL	-64K	5	256MB	3,262	8,490 ms
aggressive	25,000	OFF	-64K	5	256MB	3,243	8,536 ms
chunk_50000	50,000	NORMAL	-64K	5	256MB	3,045	20,508 ms

Three observations:

Spread is only 26% (3,821 vs 3,045). From sync=OFF to NORMAL, 1K to 50K chunks, 3 to 8 pool connections—none of it moves the needle meaningfully. ORM overhead dominates I/O savings. As Forward Email found, sync=OFF can be slower than NORMAL because WAL machinery is different, not absent.
Pool size barely matters. pool_3, pool_5, pool_8: 3,445 / 3,574 / 3,573 r/s. SQLite is single-writer regardless of pool size. SQLAlchemy’s QueuePool just serializes writes at the application layer—functionally identical to the single-writer architecture every production SQLite deployment converges on.
Chunk size is the only knob that matters for latency. p99 scales linearly with chunk size: 313 ms at 1K → 20,508 ms at 50K (66× worse). Throughput only drops ~20%. Each chunk is one transaction. Larger transactions hold the write lock longer, block WAL checkpoints, allocate more memory. A 20-second transaction means every other writer waits 20 seconds.

Raw executemany stack: 2 layers per row

Scaling behavior: ORM vs raw

Method	Scale	Throughput	Duration	Peak RSS
ORM	10M	3,696 r/s	45.1 min	177 MB
ORM	50M	3,682 r/s	226.3 min	177 MB
Raw	10M	87,893 r/s	1.9 min	155 MB
Raw	50M	65,742 r/s	12.7 min	188 MB

ORM throughput is flat across 10M and 50M rows: 3,696 → 3,682 r/s. I/O doesn’t get a vote because the ORM is busy. The ORM’s per-row overhead is so high that SQLite’s actual I/O scaling curve never shows up.

Raw throughput degrades with scale: 87,893 → 65,742 r/s (25% drop). The database file grew from 3.8 GB to 19.5 GB, B-tree depth increased, WAL checkpoints got more expensive. This is SQLite’s actual I/O scaling curve—and it’s completely invisible behind the ORM ceiling.

Single-writer plus multi-reader

Per-segment throughput: ORM never changes

Config	0→3M	3M→5M	5M→10M	Δ (3M→10M)
baseline	3,760	3,762	3,844	+1.1%
chunk_1000	3,871	3,828	3,789	-1.3%
chunk_50000	2,966	3,076	3,081	+2.7%

No config moved more than ±3.6% from 3M to 10M. Every single one is flat. ORM overhead at row one equals ORM overhead at row ten million.

This is not how databases normally behave. Most B-tree workloads degrade at scale: tree gets deeper, fewer pages fit in cache, fsync takes longer. None of that mattered here. The ORM was so far from being I/O-bound that even SQLite’s I/O improvements showed up positively.

p99 latency is flat too. Whatever your p99 looks like at 3M, that’s what it looks like at 10M. Your latency budget is set by your chunk size, full stop.

What does not move the needle when using an ORM

PRAGMA	Effect on ORM throughput	Why
`synchronous = OFF` vs `NORMAL`	< 5%	ORM overhead dominates I/O savings
`cache_size` (4MB vs 64MB)	< 3%	B-tree lookups cheap vs Python object creation
`mmap_size` (0 vs 256MB)	< 2%	Reads are fast, writes are ORM-bound
`pool_size` (3 vs 8)	< 4%	Single-writer means pool size is irrelevant for writes

If you skipped to this table, here is the summary: nothing on the left changes the right. The ORM is the ceiling.

When to throw the ORM out

The ORM is fine for most things: CRUD, reads, validation, relationship traversal, normal application work. The fast path is for the hot bulk routes.

Use this decision tree:

Scenario	ORM throughput	Raw throughput	What to do
CRUD, < 1K rows	Good enough	Premature	ORM
Bulk load 10K–1M rows	2–5 min per million	7 sec per million	Raw `executemany`
Bulk load 1M–100M rows	45 min per 10M	1.9 min per 10M	Raw `executemany`
Sustained > 1K r/s ingest	Ceiling at 3.8K	Headroom to 88K	Raw `executemany`
Upsert-heavy workloads	Works but slow	18× faster	Raw `executemany`

The hybrid repository pattern

I now use this in production: ORM for correctness, raw SQL for throughput—both in the same repository, same session, same tables.

#SQLite #ORM #SQLAlchemy #Performance #Bulk Ingestion