Google Cloud Extends BigQuery with Cross‑Engine Apache Iceberg Support
#Cloud

Google Cloud Extends BigQuery with Cross‑Engine Apache Iceberg Support

Cloud Reporter
6 min read

Google Cloud previewed a serverless Iceberg REST catalog that lets BigQuery share tables with Spark, Flink, Trino and other engines, adds managed metadata, automatic maintenance and cross‑cloud access, and positions BigQuery against AWS and Azure’s native Iceberg offerings.

What changed

At the Apache Iceberg Summit, Google announced a preview of a serverless Iceberg REST catalog integrated with BigQuery. The new service lets teams create, update and query the same Apache Iceberg tables from BigQuery and from external compute engines such as Spark, Flink, and Trino without moving data or converting formats. Key additions include:

  • Managed metadata and table lifecycle – Google now handles schema evolution, hidden partitioning, compaction and transaction logs that customers previously had to orchestrate themselves.
  • Cross‑cloud catalog access – Iceberg tables stored in Google Cloud can be queried from AWS, Azure, Databricks and Snowflake, enabling a true multi‑cloud lakehouse.
  • Unified access control – Permissions are defined once in the catalog and enforced across all query engines.
  • ObjectRefs for multimodal data – BigQuery can now reference unstructured objects in Cloud Storage alongside structured Iceberg rows, simplifying AI‑driven pipelines.
  • Knowledge Catalog integration – Metadata, lineage and policy management are exposed through Google’s governance layer (formerly Dataplex).

The preview is intended to eliminate the “hidden tax” of operating Iceberg on‑premise—manual compaction, metadata bloat and custom orchestration—by offering a fully managed control plane while preserving the open‑format benefits of Iceberg.

Featured image


Provider comparison

Feature Google BigQuery (Iceberg preview) AWS (Athena + Glue + EMR) Azure Synapse / Azure Data Lake Databricks Lakehouse
Native Iceberg support Serverless REST catalog, managed metadata, automatic compaction (preview) Athena supports Iceberg tables; Glue provides a catalog but requires manual maintenance; EMR can run Spark‑Iceberg workloads Azure Synapse now reads Iceberg via open‑source connectors; Azure Data Lake Storage holds the files but catalog is separate Databricks Runtime offers first‑class Iceberg tables with Delta‑compatible APIs; catalog is Databricks‑managed
Cross‑engine access Same catalog visible to BigQuery, Spark, Flink, Trino, and external clouds Separate catalogs per service; Athena can query S3 data, but Spark on EMR needs its own catalog configuration Synapse can query Iceberg, but Spark pools require distinct catalog definitions Databricks provides unified catalog for its own compute; external engines need federation layers
Managed metadata Automatic schema evolution, hidden partitioning, transaction log cleanup Glue crawlers generate schema; compaction and cleanup are user‑driven Azure Purview tracks metadata but does not automate Iceberg housekeeping Databricks Unity Catalog handles metadata, but still requires Spark‑side tuning
Security model Centralized IAM policies on the REST catalog; fine‑grained column‑level controls propagated to all engines IAM on S3 + Lake Formation policies; enforcement varies per service Azure RBAC + Lakehouse ACLs; consistency across services is still evolving Unity Catalog provides row‑ and column‑level security, but only inside Databricks
Pricing Pay‑as‑you‑go for BigQuery queries; catalog API calls are billed per request; storage costs follow standard GCS rates Athena charges per TB scanned; Glue catalog charges per object; EMR charges for cluster uptime Synapse charges per DWU for compute and per TB stored; additional costs for Purview metadata Databricks charges per DBU for compute and per TB stored; Unity Catalog adds a per‑catalog fee
Multi‑cloud reach Explicit support for querying Iceberg tables from AWS, Azure, Snowflake and Databricks via the REST catalog No native cross‑cloud catalog; customers build custom federation or replicate data Primarily single‑cloud; cross‑cloud requires Azure Arc or third‑party tools Focused on Databricks‑hosted clouds; external access requires data replication

Takeaway: Google’s preview narrows the functional gap with AWS and Azure by delivering a single, managed control plane that works across engines and clouds. The biggest differentiator is the centralized REST catalog that eliminates the need for separate Glue, Purview or Unity Catalog instances when a company wants to run heterogeneous workloads.


Business impact

Reduced operational overhead

Enterprises that have already adopted Iceberg for its ACID guarantees often spend significant engineering time on metadata compaction and catalog synchronization. By moving those responsibilities to Google’s managed service, teams can reallocate resources to higher‑value activities such as model training or business‑logic development. The preview’s automatic table maintenance also cuts the risk of stale snapshots that can cause query failures.

Cost predictability

BigQuery’s query‑based pricing model means organizations pay only for the data they actually scan. When combined with a managed catalog, the cost of metadata operations becomes negligible compared with the variable compute spend on Spark or Flink clusters in other clouds. Companies can therefore run exploratory analytics in BigQuery while keeping production pipelines on Spark without incurring duplicate storage or catalog fees.

Multi‑cloud flexibility

The ability to query the same Iceberg tables from AWS, Azure or Snowflake gives CIOs a real lever for vendor negotiation. If a downstream team prefers Athena for ad‑hoc analysis, they can do so without replicating data. This flexibility also supports data‑gravity strategies where raw files stay in Google Cloud Storage, but compute moves to the most cost‑effective engine for a given workload.

Faster AI integration

ObjectRefs let data scientists join structured Iceberg rows with unstructured assets (images, PDFs, logs) stored in Cloud Storage. In practice, a single BigQuery SQL statement can enrich a training dataset with metadata from a video file, streamlining multimodal AI pipelines that previously required separate ETL jobs.

Competitive positioning

Google’s move signals a shift from “storage‑only” to “context‑aware” data services. While AWS and Azure continue to charge separately for compute, storage and catalog, Google bundles the catalog into its serverless stack, making the total cost of ownership more transparent for lakehouse adopters. Companies evaluating a migration from on‑premise Iceberg deployments or from competing managed services now have a concrete benchmark: compare BigQuery’s per‑TB query cost and catalog request fees against Athena’s per‑TB scan cost plus Glue catalog charges.


Migration considerations

  1. Catalog migration – Existing Iceberg tables can be registered in the new REST catalog via the iceberg-rest API. Google provides a migration script that reads the current metadata.json files from GCS and creates corresponding catalog entries.
  2. Access‑control alignment – Review IAM policies on the REST catalog and map them to existing Lake Formation or Unity Catalog permissions to avoid privilege gaps.
  3. Query rewrite – Queries that previously referenced Spark‑only tables need only a change of the catalog endpoint; the SQL syntax remains identical because Iceberg’s table format is unchanged.
  4. Cost modeling – Run a pilot workload in BigQuery and compare the query‑bytes‑processed metric against Athena’s scanned‑bytes cost to validate the expected savings.
  5. Hybrid compute – For workloads that still require low‑latency Spark processing, keep Spark clusters on AWS or Azure and point them at the same Iceberg REST catalog. This avoids data duplication and ensures consistent schema evolution.

Final thoughts

Google Cloud’s cross‑engine Iceberg preview transforms BigQuery from a pure analytics engine into a central hub for lakehouse governance. By offering a managed catalog that works across compute engines and clouds, Google reduces the friction that has traditionally kept organizations locked into a single analytics stack. The move also raises the bar for AWS and Azure, which will need to provide comparable cross‑cloud catalog services to stay competitive.

For teams already invested in Iceberg, the logical next step is to run a proof‑of‑concept that registers a production table in the REST catalog, queries it from BigQuery and from an external Spark job, and measures both performance and cost. The results will clarify whether the managed approach delivers the promised operational savings and whether the cross‑cloud flexibility aligns with the organization’s broader cloud‑strategy.


Renato Losio is a principal cloud architect and AWS Data Hero. Connect with him on LinkedIn.

Author photo

Comments

Loading comments...