#Regulation

Serving Files Over HTTP: Synchronous Threads, Epoll, and io_uring Compared

Tech Essays Reporter
8 min read

The article walks through three implementations of a simple HTTP file server—blocking thread‑per‑request, event‑driven epoll, and modern io_uring—showing the shared code, core differences, and the trade‑offs each model presents for network and disk I/O on Linux.


Thesis

A tiny HTTP file server is an ideal laboratory for contrasting three Linux I/O strategies. By starting with a naïve thread‑per‑request design, then refactoring to an epoll‑based event loop, and finally embracing io_uring’s submission‑completion ring, we can see how each approach handles concurrency, system‑call overhead, and the fundamental asymmetry between network sockets (which can be polled) and regular files (which cannot). The progression illustrates why io_uring is rapidly becoming the default for high‑performance servers, while also exposing the practical realities that keep epoll and even plain blocking I/O relevant.


1. Shared Foundations

Both the synchronous and asynchronous versions need a small set of utilities:

  • listen_socket() – creates a non‑blocking listening socket.
  • parse_http_get() – extracts the request path from a GET line, falling back to /index.html.
  • mime_for() – maps common extensions to MIME types.
  • build_ok_headers() / build_404() – format minimal HTTP responses.

These helpers live in common.h and keep the three servers focused on I/O strategy rather than request parsing.


2. Synchronous Thread‑Per‑Request Server

Core Argument

The simplest way to serve many clients is to spawn a detached thread for each accepted connection. The thread reads the request, opens the file, streams it with read()/write(), and then exits.

Evidence & Code Walk‑through

  • main() creates a listening socket and loops on accept(). Every new descriptor is handed to pthread_create() which runs serve().
  • serve() calls parse_request() (a blocking read() loop) and then send_response() which performs a blocking open(), a header write, and a while(read()/write()) copy.
  • A tiny helper write_all() guarantees that partial writes are retried until the buffer is exhausted.

Implications

  • Simplicity – the code mirrors the textbook “one thread per client” model; no special kernel interfaces are required.
  • Scalability limits – each thread consumes stack space and scheduling overhead. Under high concurrency the kernel may thrash, and the process quickly hits the per‑process thread limit.
  • Disk I/O is still blocking – even though the network socket is owned by a single thread, the read() of the file blocks the entire thread, preventing it from serving other connections.

3. Epoll‑Based Event Loop

Core Argument

Epoll lets a single thread monitor many sockets for readiness, eliminating the thread‑per‑connection explosion. However, epoll only works for pollable descriptors; regular files cannot be added to an epoll set.

Evidence & Code Walk‑through

  • The listening socket is set O_NONBLOCK and added to an epoll instance with EPOLLIN.
  • epoll_step() calls epoll_wait() and distinguishes two cases:
    1. NULL user data – the event is the listening socket; accept all pending connections, make each new socket non‑blocking, allocate a struct conn, and register it for EPOLLIN.
    2. Pointer to struct conn – the event belongs to a client connection; on_readable() accumulates request bytes until \r\n\r\n is seen, then switches the interest to EPOLLOUT.
  • on_writable() writes any pending response headers, then reads from the file descriptor (still blocking) and writes the body until the socket would block.

Implications

  • Network I/O becomes non‑blocking – the server can handle thousands of sockets with a single thread.
  • Disk I/O remains synchronous – because regular files cannot be polled, the code still performs a blocking read() inside on_writable(). To avoid blocking the event loop, a separate thread pool would be required, re‑introducing complexity.
  • State machine complexity – each connection now carries its own read/write offsets, and the epoll loop must carefully toggle interest flags.

4. io_uring‑Based Server

Core Argument

io_uring unifies network and disk I/O under a single asynchronous interface. By submitting batches of operations to a kernel‑managed ring buffer, a single thread can drive both sockets and files without ever blocking on a system call.

Evidence & Code Walk‑through

  • io_uring_queue_init() creates a submission/completion ring.
  • A multishot accept (io_uring_prep_multishot_accept) is submitted once; the kernel repeatedly generates accept completions, each carrying a new client fd.
  • Each completion carries a cb_ctx structure with a callback pointer and opaque user data. The main loop iterates over completions, invoking the appropriate callback (on_accept, on_recv, on_write_headers, on_read_file, on_write_file, on_close).
  • The callbacks chain the logical steps:
    • on_accept → schedule recv.
    • on_recv → accumulate request, then call start_response.
    • start_response → prepare headers, open the file, schedule a send of the headers.
    • on_write_headers → once headers are flushed, schedule a read from the file.
    • on_read_file → schedule a send of the file chunk.
    • on_write_file → loop back to another read until EOF, then close.
  • All file reads (io_uring_prep_read) are truly asynchronous; the kernel may perform the I/O on a worker thread, but the application never blocks.

Implications

  • Universal async I/O – sockets and regular files are handled through the same mechanism, removing the need for a separate thread pool.
  • Reduced syscall overhead – up to the queue depth (commonly 256 or more) submissions are merged into a single io_uring_submit syscall, cutting per‑operation cost dramatically.
  • Back‑pressure handling – the kernel reports completion status; if a read returns 0 or an error, the corresponding callback can close the connection immediately.
  • Complexity shift – the mental model moves from explicit state machines to a callback‑driven pipeline. Managing the lifetime of cb_ctx objects and ensuring they are freed only when IORING_CQE_F_MORE is not set adds subtle bookkeeping.
  • Ring‑buffer limits – if io_uring_get_sqe() returns NULL the submission queue is full. The article notes that a production server would need a fallback queue or a retry loop; the sample code omits this for brevity.

5. Comparative Implications

Aspect Thread‑per‑request epoll io_uring
Concurrency model One OS thread per client Single thread multiplexing sockets Single thread multiplexing sockets and files
System‑call cost accept, read, write, open per request (blocking) epoll_wait + non‑blocking read/write (still many syscalls) Batch io_uring_submit + completions, far fewer syscalls
Disk I/O handling Blocking read on the worker thread Must offload to a thread pool or accept blocking reads Native asynchronous reads via the kernel
Memory overhead Stack per thread (often 1 MiB) Minimal per‑connection state (struct conn) Similar per‑connection state; plus ring buffers (few KiB)
Scalability ceiling Limited by thread limits and scheduler overhead Scales to tens of thousands of sockets, but disk I/O becomes bottleneck Scales best for high‑concurrency workloads that involve both network and storage
Implementation complexity Low – straightforward procedural code Moderate – state machine, edge‑trigger handling
Portability Works on any POSIX system Linux‑specific, but widely available
Future‑proofing Hard to evolve without major redesign
Epoll + thread pool Could approximate io_uring performance for disk I/O
io_uring Already handles both domains; kernel may still use worker threads internally

When to Choose Which?

  • Low traffic, simple deployments – the blocking version is acceptable; its clarity outweighs performance concerns.
  • High connection count, disk‑light workloads – epoll gives excellent network throughput with a single thread, and a modest thread pool can handle disk reads.
  • Heavy storage traffic, high concurrency – io_uring shines because it eliminates the need for a separate thread pool and reduces syscall overhead, delivering better latency and throughput.

6. Counter‑Perspectives

  • Kernel maturity – io_uring is relatively new; older kernels lack some features (e.g., multishot accept, fixed buffers). Deployments on legacy distributions may need to fall back to epoll.
  • Debugging difficulty – the callback‑centric flow can be harder to trace than a linear state machine. Tools such as trace-cmd or perf become essential.
  • Library support – many high‑level frameworks still expose epoll‑style APIs (e.g., libevent, libuv). Integrating io_uring may require additional wrappers or waiting for ecosystem adoption.
  • Resource contention – io_uring’s internal worker threads still compete for CPU with the application thread; in CPU‑bound scenarios the theoretical advantage narrows.

7. Closing Thoughts

By walking through a concrete file server, the article demonstrates that the choice of I/O primitive is not merely a performance tweak but a redesign of how an application thinks about work. Synchronous code is easy to read but does not scale; epoll introduces an event‑driven mindset that solves network concurrency but leaves disk I/O as a lingering blocker; io_uring unifies the model, allowing truly asynchronous handling of both sockets and files while dramatically cutting syscall overhead. The trade‑offs—complexity, kernel version requirements, and debugging ergonomics—mean that the “best” solution depends on workload characteristics and operational constraints. Nevertheless, for any new Linux service that expects moderate to high concurrency and non‑trivial disk access, io_uring presents a compelling default that future‑proofs the codebase.


Further reading


Phil Eaton is the founder of The Consensus and a former Postgres contributor. He maintains the Software Internals Discord and co‑runs NYC Systems.

Comments

Loading comments...