Choosing the Right Database: Relational, Document, Key‑Value, Graph, Time‑Series, and Vector
#Backend

Choosing the Right Database: Relational, Document, Key‑Value, Graph, Time‑Series, and Vector

Backend Reporter
8 min read

A pragmatic comparison of six major database families, focusing on scalability, consistency guarantees, and API patterns so architects can match storage technology to real‑world access patterns.

Choosing the Right Database: Relational, Document, Key‑Value, Graph, Time‑Series, and Vector

Featured image

When a system grows from a prototype to a production service, the storage layer becomes the first bottleneck you notice. The choice between a relational engine, a document store, a key‑value cache, a graph database, a time‑series system, or a vector index is rarely about “which one is newer”. It is about what guarantees you need, how you plan to scale, and what API shape the rest of your code expects.


1. Relational Databases (PostgreSQL, MySQL, SQL Server)

Core guarantees

  • ACID transactions – atomicity, consistency, isolation, durability are enforced by the engine.
  • Strong consistency – a read sees the latest committed write unless you explicitly use a weaker isolation level.

Scaling story

  • Vertical scaling works well for moderate traffic; adding CPU, RAM, or faster storage yields linear improvements.
  • Horizontal scaling requires sharding or logical partitioning. Sharding introduces cross‑shard joins, which the engine cannot resolve automatically. Tools like Citus (PostgreSQL) or Vitess (MySQL) provide middleware, but they add operational complexity.

API pattern

  • SQL – a declarative language that expresses joins, aggregations, and window functions in a single statement. Most ORMs (e.g., TypeORM, Hibernate) generate SQL and map rows to objects, but they also hide the cost of large joins.
  • Prepared statements and parameter binding keep the API safe from injection attacks.

When it fits

  • Financial ledgers, ERP, CRM – any domain where data integrity outweighs raw write throughput.
  • Applications that need complex ad‑hoc reporting across many tables.

2. Document Databases (MongoDB, Couchbase)

Core guarantees

  • Eventual consistency by default; strong consistency can be requested per‑operation (e.g., majority write concern in MongoDB).
  • Atomicity at the document level – a single JSON‑like document is updated atomically.

Scaling story

  • Native sharding – the driver hashes a shard key and routes requests directly to the responsible node. Adding nodes redistributes chunks automatically.
  • Horizontal writes scale to hundreds of thousands of ops/sec when the workload is write‑heavy and the schema is loosely coupled.

API pattern

  • Driver‑level CRUDinsertOne, find, updateMany. The API mirrors the JSON document model, so the data you send is the data you store.
  • Aggregation pipeline – a series of stages ($match, $group, $lookup) that process documents server‑side. It replaces many SQL GROUP BY patterns but can become hard to read for deep pipelines.

When it fits

  • Content management systems, product catalogs, and any service where the shape of a record evolves over time.
  • Micro‑services that own a bounded context and prefer a self‑contained data model.

3. Key‑Value Stores (Redis, DynamoDB, Riak)

Core guarantees

  • Strong consistency in single‑node Redis; DynamoDB offers configurable consistency per request.
  • No transactional guarantees beyond single‑key operations – multi‑key transactions are optional and often slower.

Scaling story

  • In‑memory data (Redis) gives sub‑millisecond latency; horizontal scaling is achieved with clustering and hash slots.
  • Provisioned throughput (DynamoDB) lets you pre‑allocate read/write capacity units; auto‑scaling adjusts limits based on usage patterns.

API pattern

  • Simple get/setGET key, SET key value. Extensions include sorted sets (ZADD, ZRANGE) and streams (XADD).
  • Command pipelining – batch multiple commands to reduce round‑trip latency.

When it fits

  • Session storage, caching layers, real‑time leaderboards, rate‑limiting counters.
  • Scenarios where the data model is a flat map and latency is the primary metric.

4. Graph Databases (Neo4j, Amazon Neptune)

Core guarantees

  • Transactional ACID (Neo4j) or read‑after‑write consistency (Neptune) for traversals.
  • Consistency is scoped to the subgraph touched by a query, which keeps write latency low even with many relationships.

Scaling story

  • Vertical scaling works for dense graphs; horizontal scaling is possible via sharding by vertex ID, but cross‑shard traversals become expensive.
  • Native graph indexes (e.g., label‑based indexes) accelerate pattern matching without full scans.

API pattern

  • Cypher (Neo4j) or Gremlin (Neptune) – declarative graph query languages that express patterns like MATCH (a)-[:FRIEND]->(b) RETURN a, b.
  • Traversal APIs – programmatic depth‑first or breadth‑first walks for custom algorithms.

When it fits

  • Social networks, recommendation engines, fraud detection – any domain where relationships are first‑class citizens.
  • Queries that require variable‑length path exploration.

5. Time‑Series Databases (InfluxDB, TimescaleDB, VictoriaMetrics)

Core guarantees

  • Append‑only writes – immutable data points reduce lock contention.
  • Retention policies enforce automatic data expiration, keeping storage costs predictable.

Scaling story

  • Chunked storage – data is partitioned by time intervals; queries that span recent intervals hit hot shards, older data lives on cheaper cold storage.
  • Down‑sampling pipelines aggregate high‑resolution data into lower‑resolution summaries.

API pattern

  • Line protocol (InfluxDB) – a compact text format measurement,tag1=val1 field1=123 1627846260.
  • SQL extensions (TimescaleDB) – regular PostgreSQL with time‑bucket functions (time_bucket('5 minutes', ts)).

When it fits

  • Monitoring dashboards, IoT telemetry, financial tick data – workloads that ingest millions of rows per second and query recent windows frequently.

6. Vector Databases (Pinecone, Milvus, Qdrant)

Core guarantees

  • Approximate nearest‑neighbor (ANN) search with configurable recall vs. latency trade‑offs.
  • Consistency models vary – Pinecone offers strong consistency per namespace; Milvus provides eventual consistency across replicas.

Scaling story

  • Partitioned vector indexes – vectors are sharded by hash of their ID; each shard builds its own ANN index (e.g., HNSW, IVF‑PQ).
  • Hybrid storage – hot vectors reside in RAM, cold vectors on SSD; the system streams chunks as needed.

API pattern

  • Upsertupsert(collection_name, [{id: 'doc1', vector: [...]}, …]).
  • Queryquery(collection_name, {vector: [...], top_k: 10, filter: {category: 'news'}}).
  • Metadata filters let you combine ANN search with traditional attribute constraints.

When it fits

  • Semantic search, recommendation based on embeddings, anomaly detection on high‑dimensional signals.
  • Applications that already generate vector representations (e.g., BERT embeddings) and need low‑latency similarity lookup.

7. Mapping the Decision Matrix

Requirement Best fit Consistency Scaling model
Strong transactional guarantees across many tables Relational ACID (strong) Vertical + optional sharding
Flexible schema, rapid iteration Document Configurable (eventual by default) Native sharding
Sub‑millisecond key lookups, simple data Key‑Value Strong (single‑key) Cluster hash slots
Traversal of many‑to‑many relationships Graph ACID or read‑after‑write Mostly vertical
High‑rate ingest of timestamped metrics Time‑Series Append‑only, eventual Time‑bucket partitions
Nearest‑neighbor search on embeddings Vector Configurable (often eventual) Partitioned ANN indexes

The table is not exhaustive, but it highlights the trade‑off you make when you pick a storage engine: you gain performance for a specific access pattern while giving up generality in another.


8. API Design Tips Across All Stores

  1. Encapsulate the driver behind a repository layer. This isolates the rest of the codebase from vendor‑specific quirks and makes swapping implementations easier.
  2. Prefer idempotent operations. In distributed environments, retries are common; using INSERT … ON CONFLICT DO UPDATE (SQL) or upsert (MongoDB) avoids duplicate records.
  3. Leverage bulk endpoints. Most stores provide a bulkWrite or batchPut API that reduces network overhead.
  4. Add explicit version fields (e.g., row_version or etag). Optimistic concurrency control works even when the underlying DB only offers eventual consistency.
  5. Monitor latency per operation type. A sudden increase in SELECT … JOIN latency often signals a hot shard or missing index, while a spike in GET latency for a key‑value store may indicate memory pressure.

9. Real‑World Failure Stories (What Went Wrong)

  • A fintech startup started with a single PostgreSQL instance, then added a sharding layer without redesigning their join logic. Cross‑shard queries became a nightly nightmare, leading to a costly migration to a document store where joins were unnecessary.
  • An IoT platform stored raw sensor streams in Redis for speed, but never defined a TTL. Memory usage grew linearly until the cluster crashed, forcing a redesign that moved historic data to InfluxDB and kept only the last hour in Redis.
  • A recommendation engine tried to store embeddings in DynamoDB with strong consistency. The write latency grew beyond 100 ms, breaking the user‑experience SLA. Switching to a purpose‑built vector DB reduced query latency to 8 ms and allowed configurable consistency.

These anecdotes illustrate why understanding the consistency‑scalability‑API triangle is essential before committing to a technology.


10. Getting Started Quickly

  • Relational: Deploy a managed PostgreSQL instance on your cloud provider; use the official PostgreSQL docs for connection strings and migrations.
  • Document: Try the free tier of MongoDB Atlas; the driver for Node.js offers insertOne and aggregate out of the box.
  • Key‑Value: Spin up a Redis instance with Docker (docker run -p 6379:6379 redis:7) and experiment with SET/GET and sorted sets.
  • Graph: Follow the Neo4j quick‑start guide at neo4j.com/developer. The Cypher query language is easy to learn for anyone familiar with SQL.
  • Time‑Series: Install TimescaleDB as a PostgreSQL extension (CREATE EXTENSION IF NOT EXISTS timescaledb;). Use time_bucket in regular SQL queries.
  • Vector: Sign up for a free Pinecone index at pinecone.io; the Python SDK shows how to upsert vectors and run similarity queries.

11. Bottom Line

Choosing a database is not a one‑time decision; it is an ongoing negotiation between data correctness, throughput, and developer ergonomics. By mapping your workload’s dominant access pattern to the guarantees each store provides, you can avoid the costly re‑architectures that many startups encounter after their first scaling sprint.

If you need a deeper dive into any of the six families, the original AI Study Room post contains full code samples and benchmark tables.

Comments

Loading comments...