The Lakehouse Identifier Resolution Gap: Why Your Tables Disappear Across Engines

As organizations adopt multi-engine lakehouses with Apache Iceberg, inconsistent identifier casing rules between Spark, Trino, Flink, and Snowflake cause silent failures where tables exist in metadata but are invisible to certain query engines, requiring organization-wide naming conventions and cross-engine validation to maintain portability.

The promise of the lakehouse architecture—a unified data layer accessible by diverse compute engines—has hit a subtle but pervasive snag: identifier resolution. While open table formats like Apache Iceberg standardize how data and metadata are stored on disk, they do not standardize how database engines interpret and normalize object names (databases, schemas, tables, columns). This leaves teams grappling with a 'Tower of Babel' effect where the same logical table appears differently—or not at all—depending on which engine is querying it.

Consider a simple scenario: a data engineer creates a table in Spark using CREATE TABLE analytics.UserEvents (userId STRING, eventTime TIMESTAMP). Spark, by default, preserves the exact casing provided and stores the table name as UserEvents in the Iceberg catalog. When a business analyst later tries to query this table from Trino using SELECT * FROM analytics.userevents, the query fails. Trino normalizes all identifiers to lowercase during parsing, so it looks for userevents in a catalog that actually contains UserEvents. Even if the catalog performed case-insensitive lookups (which many don't), column-level access would still fail: SELECT UserId FROM UserEvents would not find a column stored as userId in Iceberg metadata when queried from an engine with strict case sensitivity.

This isn't theoretical. In practice, teams observe:

Tables created with mixed-case names in Spark becoming invisible to Trino
Column-level queries failing in Flink despite working in Spark due to casing mismatches
Snowflake's CASE_INSENSITIVE mode masking table-level issues while exposing column-level problems
AWS Glue Data Catalog rejecting PascalCase table names outright, forcing schema rewrites

The root cause lies in divergent philosophies across the stack. Engines like Spark, Flink, and DuckDB preserve identifier casing as provided, while Trino and Snowflake (via CLD) normalize to lowercase. Catalogs add another layer: Apache Polaris maintains case-sensitive matching, AWS Glue forces lowercase, and Unity Catalog standardizes to lowercase. When user intent, engine behavior, and catalog semantics don't align, the result is metadata that accurately reflects stored names but fails to resolve queries as expected.

![Figure 1: Multi-engine lakehouse setup showcasing identifier issues]() Figure 1. Multi-engine lakehouse setup showcasing identifier issues across popular database engines.

The impact extends beyond simple query failures. In multi-engine pipelines where Spark writes data and Trino reads it for analytics, a casing mismatch can make entire datasets appear empty, triggering false alerts or missed SLAs. For real-time systems using Flink, column-level casing errors can cause silent data corruption when downstream systems expect specific field names. These issues are particularly insidious because they don't produce syntax errors—they manifest as empty result sets or incorrect data, making them difficult to trace in complex workflows.

Addressing this requires treating identifier naming as a first-class data contract, not an engine-specific detail. The most reliable approach is adopting a strict, organization-wide naming convention. Limiting all identifiers to lowercase with underscores (snake_case) works universally because:

It aligns with Trino's lowercase normalization
It avoids case-sensitivity pitfalls in Flink and Spark
It satisfies catalogs like AWS Glue that enforce lowercase
It eliminates ambiguity in column references

However, simply mandating snake_case isn't sufficient. Teams must validate behavior across their entire stack. A lightweight CI job that:

Creates a table via the primary ingestion engine (e.g., Spark)
Verifies discoverability and queryability from every other engine (Trino, Flink, etc.)
Checks both table-level and column-level access

can catch normalization surprises before they reach production. For example, testing should confirm that a table named user_events created in Spark is accessible as user_events in Trino, user_events in Flink, and that columns like event_timestamp resolve correctly in all engines.

Configuration tuning offers additional levers:

In Spark: set spark.sql.caseSensitive=false to make column resolution case-insensitive
For Snowflake CLD: ensure CATALOG_CASE_SENSITIVITY=CASE_INSENSITIVE
In Trino when using Polaris: enable iceberg.rest-catalog.case-insensitive-name-matching=true to build a lowercase-to-original mapping
Avoid disabling validation in AWS Glue (glue.skip-name-validation=true) as it creates drift between Terraform state and actual catalog state

![Figure 2: High-level flow of identifier resolution]( Lakehouse Tower of Babel: Handling Identifier Resolution Rules Across Database Engines - InfoQ ) Figure 2. High-level flow of identifier resolution showing user intent, engine, catalog, and storage layers.

The deeper lesson is that shared storage and unified catalogs are necessary but insufficient for true interoperability. As one data architect put it: 'We solved the problem of having one copy of the data, but we forgot that engines need to agree on what to call it.' Organizations migrating to lakehouses must treat identifier resolution with the same rigor as schema evolution or data quality checks. This means:

Documenting naming conventions in data contracts
Including identifier validation in pipeline testing
Treating casing inconsistencies as breaking changes requiring versioning

Until the SQL dialect gap narrows—which remains challenging given vendors' historical commitments—the burden of maintaining cross-engine portability falls on the data engineering team. By treating identifier naming as a shared language rather than an engine quirk, teams can turn a potential source of silent failures into a predictable, governed aspect of their lakehouse architecture.

The Lakehouse Identifier Resolution Gap: Why Your Tables Disappear Across Engines

Comments