The Five-Year Quest for Recursion in lychee: A Study in Concurrency Challenges

The journey to implement recursion in lychee, a Rust-based link checker, reveals fundamental challenges in concurrent system design that transcend any single programming language or implementation approach.

The story of recursion in lychee is less about a specific feature and more about the inherent difficulties of building concurrent systems that can self-reference. What began as a seemingly simple request—to add a --recursive flag to follow links within a domain—has become a five-year odyssey through the complexities of distributed termination detection, circular dependencies, and backpressure management.

The Deceptive Simplicity of Recursion

When @styfle opened issue #78 in December 2020, the request appeared straightforward: add recursion to lychee, a fast, async link checker written in Rust. The tool had already gained significant traction, with around 40k GitHub repositories depending on it, including major players like Google, AWS, Microsoft, and Cloudflare using it to check links in their documentation. The expectation was that adding recursion would be "an honest day's work."

Yet five years, four serious implementation attempts, and several abandoned pull requests later, recursion remains unshipped. The issue has become lychee's "white whale," and the reasons why reveal fundamental truths about concurrent system design that apply across programming languages and paradigms.

The Architectural Challenge

To understand why recursion proved so elusive, one must examine lychee's architecture. The original design followed a simple, unidirectional pipeline: input URLs flow through link extraction, to link checking, to output formatting. The program terminates when the input stream is exhausted.

lychee's initial architecture

Recursion fundamentally breaks this model by requiring a feedback loop: responses must generate new inputs that flow back through the same pipeline. This creates a cycle in what was designed as a directed acyclic graph (DAG), and cycles in channel-based systems are where the dragons live.

Attempt 1: The Fragile Counter (2021)

The first implementation attempted to solve the problem by manually counting requests. After processing a response, the system would extract links and push them back into the request channel while maintaining a running count of total expected requests versus completed requests.

The approach failed due to a fundamental fragility in distributed counting. New links were discovered asynchronously, so the total request count could be incremented after the main loop had already decided to exit. This led to either hanging (if the count was too high) or premature termination (if too low). As the author noted, "The counting logic got gnarlier" with every edge case—cached responses, failed requests, empty pages—all requiring special handling that further complicated the already delicate counting mechanism.

Attempt 2: Channel-Based Feedback Loop (2022)

The second attempt abandoned manual counting in favor of feeding discovered URLs back through a channel connected to the collector. This created a feedback loop where the collector would read from an input channel and produce a stream of requests, while the recursion handler would read responses and send new inputs back to the collector's channel.

This approach failed with a classic channel deadlock. For the collector's stream to end, the input channel needed to close. For the channel to close, all senders needed to be dropped. But the recursion handler held a sender to push discovered URLs, creating a circular dependency where each stage needed to hold something to keep the cycle alive.

The implementation also suffered from a 30% performance regression in the non-recursive case—a significant cost in a Rust project where "you don't pay for what you don't use" is practically a moral position.

Attempt 3: Semaphores and Spawning (2022)

The third attempt abandoned channels entirely, using semaphores to manage concurrency and tokio::spawn for each unit of work. The prototype was conceptually clean but quickly became complicated in practice. The link checker needed to share client config, cache, progress bar, stats, and other state across spawned tasks, requiring extensive wrapping in Arc<RwLock<State>>.

This approach revealed a deeper truth: semaphores solve concurrency limits, not termination detection. With spawned tasks, there's no built-in way to know when all tasks—including recursively spawned ones—have finished. The team would need to reinvent the counter from Attempt 1, but now spread across an unbounded number of spawned tasks.

Attempt 4: The Atomic Counter (2025)

In early 2025, community contributor @gwennlbh revisited the channel-based approach with an Arc<AtomicUsize> counter. This elegant solution tracked remaining work—incrementing when new requests were sent and decrementing when responses were processed, breaking out when the count hit zero.

This implementation actually worked on real websites and represented the most functional attempt yet. However, it ultimately failed from multiple directions:

Channel Backpressure Deadlock: When recursion discovered many links, the response handler blocked trying to send into a full request channel, causing a classic backpressure deadlock.
Duplicate Requests: The same URL could be discovered by multiple pages and sent into the channel before any were cached, leading to redundant work.
Counter Fragility: The atomic counter suffered from the same fundamental issues as the manual counter in Attempt 1, with relaxed memory ordering potentially causing the counter to read zero before work was actually complete.
Architectural Changes: Adding subsequent_uris to the Response type required touching nearly every file that builds or consumes a Response, creating a leaky abstraction.

The Fundamental Challenges

These repeated failures revealed several fundamental challenges that transcend any specific implementation:

Distributed Termination Detection: In a recursive pipeline, the input stream is never truly exhausted because every response might create new inputs. Detecting when the system has reached quiescence—where nothing is in progress and nothing new will be generated—is a known problem in distributed systems with classic solutions like Dijkstra–Scholten and token passing that don't map well onto Tokio's channel-based world.
Channel Cycles: Channel-based systems deadlock when cycles exist because channels use "all senders dropped" as their termination signal, and in a cycle that condition is never met on its own.
Backpressure Management: Bounded channels provide natural backpressure, but in recursive systems, the response handler needs to send into the request channel. If that channel is full, the response handler blocks, preventing responses from being consumed and request slots from freeing up.
Deduplication Races: Concurrent checking means multiple pages can hold the same link. Without synchronization, several tasks can discover the same URL and submit it before any can mark it "seen."

The Turning Point: Indirect Progress

Despite the failed attempts, significant progress was made on the system that ultimately made recursion possible, though this work wasn't originally intended to support recursion:

Per-Host Rate Limiting (PR #1929): Recursion without rate limiting is dangerous, as @gwennlbh discovered when they accidentally DDoS'd their own WiFi router. The HostPool introduced per-host request queues with configurable rate limits, delays, and concurrent-request caps.
WaitGroup Primitive (PR #2046): This was the breakthrough for termination detection. WaitGroup provides a mechanism for waiting on a dynamic set of tasks that can themselves spawn more tasks. It consists of a WaitGroup that fires when all work is done, and cloneable WaitGuards held by each task. When the last guard is dropped, the waiter completes.
Unified Request Handling (PR #2100): This unified input URL fetching with the link checker's HostPool, ensuring recursively discovered pages use the same client configuration as everything else.
Sitemap Support (PR #2062): While not a replacement for true recursion, sitemap parsing unblocks many use cases by allowing lychee to discover every page on a site without crawling recursively.

The Rust Question

A natural question is how much of these difficulties were specific to Rust. The author estimates about 30%:

The termination problem, cycle problem, and backpressure problem are inherent to any concurrent recursive crawler. What Rust added was friction at the implementation level: Ownership and Send bounds make it harder to share state across spawned tasks. In Go, you capture variables in a goroutine closure and move on; in Rust, everything in async-land wants to be Arc-wrapped and Send + 'static.

However, Rust also prevented many issues through the compiler and type system, making wrong approaches fail loudly and the right approach more solid.

The Path Forward

With these foundational pieces in place, recursion is finally within reach. The hard parts—knowing when to stop, avoiding deadlocks, preventing server flooding—are already solved by work that was never about recursion originally.

The remaining implementation is remarkably simple: when a checked page is on an allowed domain and under the depth limit, grab its content from cache, extract links, and send them back through the same pipeline. The author notes:

lychee goes weeeee...

Recursion becomes a by-product of good architecture, not a special case bolted onto a pipeline that was never built for it.

Lessons Learned

This five-year journey offers several valuable insights:

Some Features Are Systemic: Recursion couldn't be "bolted on" to lychee's existing architecture; it required fundamental changes to how the system operated.
Progress Isn't Always Linear: The most important code written for a feature is often the code that never mentions the feature at all, as foundational components enable the capability indirectly.
Community Contributions Face Unique Challenges: Outside contributors face friction from build environment differences, conflicts with moving targets, and the cognitive load of complex async codebases.
Problem Recognition Evolves: What began as "it's hard" evolved into a precise understanding of distributed termination detection, channel cycles, and backpressure management—concepts with established names and solutions in distributed systems.

Conclusion

Did the team fail? After five years and four attempts, it's tempting to say yes. But writing out the full story changes that perspective. Every attempt hit fundamental challenges in concurrent system design that transcend any specific implementation or language.

The real story is one of progress through architectural evolution. The components that finally make recursion possible—WaitGroup, HostPool, unified request handling—were developed to solve other problems entirely. This is a common pattern in software development: the most important enablers of future features are often built without those features in mind.

As the author reflects:

Sometimes the most important code you write for a feature is the code that never mentions the feature at all. So no, I don't think we failed. We made progress by stumbling into the right direction.

The journey to recursion in lychee demonstrates that even seemingly simple features can touch the deepest complexities of system design. It serves as a reminder that some problems require not just implementation effort, but conceptual evolution—understanding the problem space well enough to recognize when the foundations are ready to support what once seemed impossible.