A video discovery platform tackled the challenge of duplicate video entries caused by fragmented URLs through a systematic canonicalization pipeline. By separating URL identity from display strings and enforcing deduplication at the database level, the system improved scalability and consistency while avoiding costly manual fixes.
How a Video Discovery Platform Solved Duplicate URLs with a Canonicalization Pipeline
URL fragmentation is a silent killer for discovery platforms. A single YouTube video might appear under dozens of URLs—through tracking parameters, shortlinks, or regional variants—leading to near-duplicate content that fragments search rankings and clutters user interfaces. This article details how DailyWatch, a free video discovery platform, built a canonicalization pipeline to normalize these URLs into a single stable identity, ensuring scalability and consistency.
The Problem: Identity vs. String
The core issue isn't string manipulation—it's identity. URLs are lossy encodings of a video's true identity (platform + video ID). Treating URLs as strings leads to duplicates because https://youtu.be/abc123 and https://www.youtube.com/watch?v=abc123 are semantically identical but structurally different. Early attempts to clean URLs by stripping parameters or normalizing hosts failed because they ignored the underlying identity.
Key Insight: Separate Normalization from Deduplication
The solution split the process into two distinct stages:
- Normalization: Convert raw URLs into a standardized string format for display and storage. This includes collapsing subdomains (e.g.,
m.youtube.com→youtube.com), standardizing schemes (alwayshttps), and filtering out tracking parameters. - Identity Extraction: Derive a canonical key (e.g.,
youtube:abc123) from the normalized URL. This key becomes the deduplication anchor, independent of URL formatting.
This separation ensures that even if YouTube changes its URL structure, the deduplication logic remains intact because it relies on the video's identity, not the URL string.
The Pipeline: Four Stages
1. Normalize the URL String
The normalizer processes URLs through a deterministic sequence of steps:
- Scheme enforcement: Add
https://to bare hosts for consistent parsing. - Host normalization: Collapse subdomains (e.g.,
m.youtube.com→youtube.com) and standardize domain names. - Query filtering: Remove globally useless parameters (e.g.,
utm_source) while preserving platform-specific ones (e.g.,vfor YouTube video IDs). - Path cleaning: Decode and re-encode paths to collapse duplicate slashes.
The result is a clean, human-readable URL like https://www.youtube.com/watch?v=abc123 that’s safe to display or link to.
2. Extract a Stable Identity
The identity extractor maps normalized URLs to a platform-specific key. For YouTube, this involves parsing v=abc123 from the query string or extracting IDs from paths like /shorts/abc123. The extractor uses strict pattern matching to validate IDs (e.g., 11-character alphanumeric strings for YouTube) to prevent malformed keys from causing data corruption.
3. Build a Canonical URL
Once the identity is established, the canonical URL is regenerated from it. For YouTube, this is always https://www.youtube.com/watch?v=abc123. This step ensures SEO consistency and avoids redirect chains by using the platform’s official canonical format.
4. Deduplicate at the Database Level
The identity key is enforced as a unique constraint in the database. SQLite’s ON CONFLICT clause handles concurrent ingests idempotently: if a video’s identity already exists, the system updates the last_seen timestamp instead of creating duplicates. This approach scales efficiently, as deduplication is offloaded to the database layer.
Trade-offs and Practical Considerations
Cost of Complexity
Building this pipeline required upfront effort to separate identity from string operations. However, the long-term benefits—reduced storage bloat, faster search, and fewer manual deduplication jobs—outweigh the initial complexity.
Handling Redirects
Shortlinks (e.g., bit.ly) and consent redirects pose a challenge because they obscure the true URL. The system resolves these via a bounded HTTP HEAD request with strict guardrails: only follow redirects for known shorteners, limit hop counts, and cache resolutions to avoid network overhead.
Validation is Critical
Malformed IDs (e.g., truncated YouTube IDs) can silently merge unrelated videos. The extractor’s strict validation ensures only valid keys are stored, preventing data corruption.
Operational Impact
The pipeline delivered measurable improvements:
- Search dedup: FTS5 indexing on a deduplicated table eliminated duplicate search results.
- Cache efficiency: Cloudflare and LiteSpeed caches keyed off canonical URLs reduced redundant page loads.
- Trending accuracy:
last_seentimestamps aggregated across all sources improved trending algorithms.
Lessons Learned
- Separate identity from string: Treating URLs as strings leads to fragile systems.
- Validate identities rigorously: A single bad key can corrupt data at scale.
- Deduplicate at the database layer: Application-level checks risk race conditions.
MongoDB PROMOTED
Build fast on MongoDB Atlas without the fear of outgrowing your database. Don’t let your database dictate your speed. With MongoDB Atlas, the same document model you use for your MVP handles global scale across AWS, Azure, and Google Cloud. Start free and stay fast as you grow.

Image: MongoDB Atlas dashboard showing scalable database instances.

Comments
Please log in or register to join the discussion