Deduplicating Video URLs at Scale with a PHP Canonicalization Pipeline

A duplicated video catalog is usually not a URL problem. It is an identity problem hiding behind strings.

Problem

A content aggregator can ingest the same video many times without any single component looking obviously broken. One region sees a youtube.com/watch?v=... URL. Another receives a youtu.be/... share link. A mobile endpoint returns m.youtube.com. An embed scraper stores /embed/.... A shorts feed stores /shorts/.... Add tracking parameters, reordered query strings, and short-link redirects, and the database now has several rows for one piece of content.

That failure starts as a string comparison bug, but it becomes a systems problem quickly. If the primary uniqueness constraint is the raw URL, every syntactically different representation becomes a separate fact. Search indexes return duplicates. Trending jobs count the same video multiple times. Regional availability becomes inaccurate. Sitemaps emit duplicate content. Cache keys fragment. Moderation and enrichment jobs repeat work.

The key lesson is that a URL is often a locator, not an identity. For video platforms, the identity is usually a platform-specific stable ID. YouTube has an eleven-character video ID. Vimeo and Dailymotion have their own URL conventions. Unknown providers may not expose a clean identity, so the fallback has to be conservative.

Solution Approach

A practical pipeline separates canonicalization into two stages: URL normalization and identity extraction.

URL normalization cleans the string while preserving meaning. It lowercases and aliases hosts, removes known tracking parameters, normalizes paths, sorts query parameters, and emits a deterministic URL. PHP gives you the basic pieces through parse_url, parse_str, and http_build_query. The job is not to guess intent. The job is to remove noise you understand and keep everything else.

Identity extraction then asks a different question: what object does this URL identify? For YouTube, /watch?v=dQw4w9WgXcQ, /embed/dQw4w9WgXcQ, /shorts/dQw4w9WgXcQ, and youtu.be/dQw4w9WgXcQ all resolve to the same canonical identity: youtube:dQw4w9WgXcQ. This is the value that belongs in a uniqueness constraint. The normalized URL is still useful for display, auditing, and outbound links, but it should not be the deduplication key.

A typical PHP implementation can keep these responsibilities small. One class normalizes URLs. Another resolves platform identity. A storage class writes by canonical key. That division matters because URL formats change. When YouTube adds a new path shape, the identity resolver changes. When a new tracking parameter appears, the normalizer changes. Those should not be the same test surface.

The database design should reflect the same model. A videos table can use canonical_key as the primary key, with columns for platform, video_id, title, and timestamps. A separate video_regions table can store (canonical_key, region) as a composite primary key. That turns regional presence into an explicit many-to-many relationship instead of a side effect of whichever URL variant arrived first.

SQLite is enough for this pattern if the workload fits the host. ON CONFLICT gives you an idempotent write path: insert the canonical video if it is new, otherwise update fields such as last_seen and title. FTS5 can index the canonical rows rather than every raw ingest event. In PHP, PDO keeps the write path straightforward and portable.

The important property is convergence. If eight regional crons ingest the same video, the database should end with one videos row and eight video_regions rows. The order should not matter. Retrying yesterday's feed should not create new identities. Running the same cron twice should be boring.

Redirects need different treatment. A short URL cannot be fully canonicalized without an HTTP request, and putting network calls on the hot ingest path is how a slow upstream turns into a backed-up cron queue. Resolve redirects out of band with tight timeouts, capped hops, and a cache table such as url_redirects(short, resolved, resolved_at). PHP's cURL can do this, but the operational rule is more important than the library choice: fail soft, cache aggressively, and never let redirect resolution block the rest of ingestion.

Backfills are the dangerous part. Once duplicates already exist, the migration must group rows by canonical key, choose a survivor, union the region memberships, update dependent indexes, and delete duplicates inside a transaction. A dry run against a database copy should print counts before it writes. The cost of a bad canonical key is not a failed deploy, it is deleted or merged content that may be difficult to reconstruct.

Trade-offs

The first trade-off is between aggressive deduplication and false merges. If the resolver treats too many URL shapes as equivalent, distinct content can collapse into one row. That is worse than a duplicate because it corrupts identity. Platform-specific validation helps. For YouTube, requiring the documented eleven-character ID pattern is a simple guardrail. For unknown providers, hashing the normalized URL is safer than pretending to understand their identity model.

The second trade-off is between completeness and ingest latency. Resolving every redirect inline would catch more duplicates sooner, but it imports external failure into the write path. A cached asynchronous resolver means some duplicates may survive until the next cycle, but the ingest system remains predictable. Systems that process feeds repeatedly can usually accept eventual cleanup. Systems that require immediate uniqueness may need a queue, a worker pool, and stricter backpressure controls.

The third trade-off is consistency. A single SQLite database with transactional upserts gives strong local consistency for one host. If the same design moves to distributed storage, the guarantees change. A relational database with a unique index on canonical_key still gives a clear conflict point. A distributed datastore may need conditional writes, compare-and-set semantics, or a dedicated identity service. Without that central conflict boundary, two workers can both believe they created the canonical record.

The API pattern follows from the data model. Ingest APIs should accept raw source data, but internal write APIs should operate on resolved identity. That means the boundary looks like ingest(rawUrl, region, metadata), then internally becomes normalize, resolveIdentity, and upsertCanonicalVideo. Downstream systems should read by canonical_key, not by raw URL. This keeps search, ranking, enrichment, moderation, and sitemap generation aligned around the same identity.

There is also an observability angle. Canonicalization should produce metrics: normalization failures, identity resolution failures, duplicate collapse count, redirect cache hit rate, and provider-specific fallback rates. A sudden increase in generic hashed identities may mean a platform changed its URL format. A spike in redirect timeouts may mean a short-link provider is slow. These are data quality signals, not just parser errors.

The broader pattern is common in distributed systems: normalize inputs at the edge, extract a stable identity, enforce uniqueness at the storage boundary, and make writes idempotent. The code can be plain PHP and SQLite. The hard part is deciding where identity lives and refusing to let raw external strings become primary keys by accident.

A video catalog that fans out across regions will always see repeated facts. The system should treat repetition as normal input, not as an exceptional condition. Canonical keys, platform-aware resolvers, transactional upserts, and explicit region membership turn duplicate-prone ingestion into a convergent pipeline. That is the difference between a feed that grows with content and a feed that grows with every spelling of the same URL.

#canonicalization #deduplication #URL normalization #Database Design #backend-scaling

Deduplicating Video URLs at Scale with a PHP Canonicalization Pipeline

Problem

Solution Approach

Trade-offs

Comments