Building a Personal News Digest Bot: What a Simple Aggregator Teaches About Backend Architecture

A computer science student got tired of to-do list clones and built a daily tech news digest that aggregates Hacker News, Dev.to, and NewsAPI, summarizes the dense stuff with an LLM, and emails the result every morning. The project is small, but the design questions it forces (rate limits, deduplication, failure handling, scheduling) are exactly the ones backend engineers wrestle with at scale.

There is a well-worn path for people learning to code: build a Twitter clone, a Pokédex, a to-do list. These projects teach syntax and component wiring, and they fill a portfolio's first few rows. But they share a flaw that becomes obvious once you have done two or three of them. Nobody uses them, not even the person who built them. The feedback loop that makes engineering interesting, the one where a real user hits a real edge case, never closes.

Raissa Cavalcanti, a CS student focusing on backend and data engineering, hit that wall and built something different instead. Her Tech News Digest is a Python script that runs every morning, pulls the day's most relevant posts from several tech sources, optionally summarizes the dense ones with an LLM, and delivers a clean HTML email. The problem it solves is mundane: staying current without drowning in open browser tabs. The architecture it requires is anything but.

The problem: pull versus push

The usual way to stay informed is a pull model. You open Hacker News, Dev.to, and a handful of news portals, and you scan. The cost is your attention, and the infinite feed is designed to extract as much of it as possible. The digest flips this to a push model. The system does the scanning on a schedule, applies a filter, and pushes a fixed-size summary to you. You trade the freshness and serendipity of browsing for a bounded, predictable cost in time.

That trade-off is the whole point, and it is the same one that shows up in real distributed systems all the time. Polling versus webhooks. Cron-driven batch jobs versus streaming. Materialized digests versus live queries. The student-sized version of this problem has the same shape as the production one, which is exactly why it is worth building.

Solution approach: four stages, each with a sharp edge

The digest breaks into four stages: aggregation, summarization, formatting, and scheduling. On the surface each is a few lines of code. Look closer and each one hides a decision that production systems spend real engineering budget on.

Aggregation across heterogeneous sources

The script talks to multiple APIs: the Hacker News Firebase API, the Dev.to API, and NewsAPI. The first thing you learn here is that no two APIs agree on anything. Hacker News gives you item IDs you have to fetch individually, so a single "top stories" view is one request for the ID list plus N requests for the items. Dev.to returns articles with tags and reactions in one shot. NewsAPI paginates and rate-limits aggressively on the free tier.

This is where a beginner project quietly becomes a systems problem. If you fire all your Hacker News item requests in a tight loop, you will hit throttling or just hammer the endpoint. The fix is the same pattern you would reach for at scale: bound your concurrency, add backoff on failures, and cache item lookups so a story that appears two days running is not fetched twice. The naive version works on day one and breaks the first time a source is slow or returns a 429. Designing for that failure, rather than discovering it in production, is the lesson.

Deduplication is the other aggregation problem nobody warns you about. The same launch, outage, or release gets posted to all three sources within hours. Without a dedupe step keyed on normalized URL or title similarity, your "clean" digest is three copies of the same story. Even a simple canonical-URL hash gets you most of the way, and it forces you to think about what identity means for a piece of content, a question that gets genuinely hard once you involve URL shorteners and tracking parameters.

Summarization with an LLM, used carefully

For the denser articles, the project routes text through Google's Gemini to produce short, bulleted summaries. This is the most tempting stage to over-engineer and the easiest to get subtly wrong. An LLM call is a network call to a service with its own latency tail, its own rate limits, and its own failure modes, and it costs money per token. If the summarizer is on the critical path and the API is down, does the whole digest fail, or does it degrade to sending the raw headlines?

The right answer is degrade. Summarization should be an enhancement, not a dependency. Wrap each call in a timeout, fall back to the original excerpt on error, and the digest still ships every morning even when the model API is having a bad day. That is the consistency-versus-availability trade-off in miniature: you decide that a slightly less polished email delivered on time beats a perfect one that never arrives.

Formatting and delivery

Stage three wraps everything in a minimal HTML template and sends it. Email rendering is its own swamp (inline styles, ancient clients, broken flexbox), but the deeper point is that the output format is part of the contract. A digest that is hard to skim defeats its own purpose. Keeping the template boring and legible is a feature, not a shortcut.

Scheduling and the illusion of "it just runs"

The script is triggered by a cron job or a cloud scheduler so it runs unattended at the start of each day. This is where personal projects meet the reliability questions that define operations work. What happens when a run fails at 6 a.m. and you are asleep? Do you get an empty inbox and assume there was no news, or do you get an alert? Where do the logs go? If the machine reboots, does the schedule survive?

A cron line is one line. Running something on a schedule that you can trust is a discipline. The gap between those two is most of what site reliability engineering is about, and you feel it the first time you build something you actually depend on.

Trade-offs and where this goes next

The honest version of this project is that it is not finished, and the unfinished parts are the interesting ones. Right now the filtering is coarse, the dedup is probably URL-based, and state likely lives in a flat file or local store. Each of those is fine at a scale of one user and one machine. Each of them is also a fork in the road toward a real system.

State is the first thing that breaks if you grow it. The moment you want to remember which stories you already sent, track read versus unread, or let more than one person subscribe with their own topic filters, you need a datastore with a real schema and durability guarantees. A document model fits this kind of semi-structured, source-varied content well, since a Hacker News item and a NewsAPI article do not share a fixed shape. That is the point where managed databases with built-in sharding and failover stop being overkill and start being the path of least resistance, letting you keep the logic in the script rather than in operational toil.

Personalization is the second fork. A single shared filter is a content problem. Per-user topic preferences with relevance ranking is a recommendation problem, and once you are scoring articles per user you are into the territory of embeddings and vector similarity. That is a large jump in complexity, and the right call for a personal tool is almost always to not take it until the simpler version proves the idea is worth scaling.

The broader pattern worth taking away is this. The best small projects are not scaled-down versions of impressive systems. They are honest versions of real problems, and real problems come with rate limits, partial failures, schema mismatches, and scheduling that has to survive a reboot. A to-do list clone hides all of that behind a happy path. A digest bot that lands in your inbox every morning cannot, because you will notice the morning it does not. That noticing, the closed feedback loop between a system and someone who actually relies on it, is the part of engineering no tutorial can hand you. You have to build something you need.

If you are stuck in the tutorial loop, the prompt that breaks you out is not "what impressive thing can I clone." It is "what small, recurring annoyance in my own week could I automate." The annoyance guarantees a real user, and the real user guarantees the edge cases that turn a coding exercise into engineering.

#system-design #personal-projects #API aggregation #Python #Learning