How a Video Discovery Platform Solved Duplicate URLs with a Canonicalization Pipeline
#DevOps

How a Video Discovery Platform Solved Duplicate URLs with a Canonicalization Pipeline

Backend Reporter
4 min read

A video discovery platform tackled the challenge of duplicate video entries caused by fragmented URLs through a systematic canonicalization pipeline. By separating URL identity from display strings and enforcing deduplication at the database level, the system improved scalability and consistency while avoiding costly manual fixes.

How a Video Discovery Platform Solved Duplicate URLs with a Canonicalization Pipeline

URL fragmentation is a silent killer for discovery platforms. A single YouTube video might appear under dozens of URLs—through tracking parameters, shortlinks, or regional variants—leading to near-duplicate content that fragments search rankings and clutters user interfaces. This article details how DailyWatch, a free video discovery platform, built a canonicalization pipeline to normalize these URLs into a single stable identity, ensuring scalability and consistency.

The Problem: Identity vs. String

The core issue isn't string manipulation—it's identity. URLs are lossy encodings of a video's true identity (platform + video ID). Treating URLs as strings leads to duplicates because https://youtu.be/abc123 and https://www.youtube.com/watch?v=abc123 are semantically identical but structurally different. Early attempts to clean URLs by stripping parameters or normalizing hosts failed because they ignored the underlying identity.

Key Insight: Separate Normalization from Deduplication

The solution split the process into two distinct stages:

  1. Normalization: Convert raw URLs into a standardized string format for display and storage. This includes collapsing subdomains (e.g., m.youtube.comyoutube.com), standardizing schemes (always https), and filtering out tracking parameters.
  2. Identity Extraction: Derive a canonical key (e.g., youtube:abc123) from the normalized URL. This key becomes the deduplication anchor, independent of URL formatting.

This separation ensures that even if YouTube changes its URL structure, the deduplication logic remains intact because it relies on the video's identity, not the URL string.

The Pipeline: Four Stages

1. Normalize the URL String

The normalizer processes URLs through a deterministic sequence of steps:

  • Scheme enforcement: Add https:// to bare hosts for consistent parsing.
  • Host normalization: Collapse subdomains (e.g., m.youtube.comyoutube.com) and standardize domain names.
  • Query filtering: Remove globally useless parameters (e.g., utm_source) while preserving platform-specific ones (e.g., v for YouTube video IDs).
  • Path cleaning: Decode and re-encode paths to collapse duplicate slashes.

The result is a clean, human-readable URL like https://www.youtube.com/watch?v=abc123 that’s safe to display or link to.

2. Extract a Stable Identity

The identity extractor maps normalized URLs to a platform-specific key. For YouTube, this involves parsing v=abc123 from the query string or extracting IDs from paths like /shorts/abc123. The extractor uses strict pattern matching to validate IDs (e.g., 11-character alphanumeric strings for YouTube) to prevent malformed keys from causing data corruption.

3. Build a Canonical URL

Once the identity is established, the canonical URL is regenerated from it. For YouTube, this is always https://www.youtube.com/watch?v=abc123. This step ensures SEO consistency and avoids redirect chains by using the platform’s official canonical format.

4. Deduplicate at the Database Level

The identity key is enforced as a unique constraint in the database. SQLite’s ON CONFLICT clause handles concurrent ingests idempotently: if a video’s identity already exists, the system updates the last_seen timestamp instead of creating duplicates. This approach scales efficiently, as deduplication is offloaded to the database layer.

Trade-offs and Practical Considerations

Cost of Complexity

Building this pipeline required upfront effort to separate identity from string operations. However, the long-term benefits—reduced storage bloat, faster search, and fewer manual deduplication jobs—outweigh the initial complexity.

Handling Redirects

Shortlinks (e.g., bit.ly) and consent redirects pose a challenge because they obscure the true URL. The system resolves these via a bounded HTTP HEAD request with strict guardrails: only follow redirects for known shorteners, limit hop counts, and cache resolutions to avoid network overhead.

Validation is Critical

Malformed IDs (e.g., truncated YouTube IDs) can silently merge unrelated videos. The extractor’s strict validation ensures only valid keys are stored, preventing data corruption.

Operational Impact

The pipeline delivered measurable improvements:

  • Search dedup: FTS5 indexing on a deduplicated table eliminated duplicate search results.
  • Cache efficiency: Cloudflare and LiteSpeed caches keyed off canonical URLs reduced redundant page loads.
  • Trending accuracy: last_seen timestamps aggregated across all sources improved trending algorithms.

Lessons Learned

  1. Separate identity from string: Treating URLs as strings leads to fragile systems.
  2. Validate identities rigorously: A single bad key can corrupt data at scale.
  3. Deduplicate at the database layer: Application-level checks risk race conditions.

MongoDB PROMOTED

Build fast on MongoDB Atlas without the fear of outgrowing your database. Don’t let your database dictate your speed. With MongoDB Atlas, the same document model you use for your MVP handles global scale across AWS, Azure, and Google Cloud. Start free and stay fast as you grow.

![MongoDB Atlas](MongoDB Atlas image)

Image: MongoDB Atlas dashboard showing scalable database instances.

Comments

Loading comments...