Google BigQuery Previews Cross-Region SQL Queries for Distributed Data

Google BigQuery introduces global queries, enabling SQL analysis across geographically distributed datasets without ETL pipelines, though with added latency and costs.

Google Cloud has unveiled a preview of global queries for BigQuery, a feature that allows developers to run SQL queries across data stored in different geographic regions without first moving or copying the data. This new capability simplifies analytics for companies with distributed datasets while letting them control where the query runs, eliminating the need for complex ETL pipelines.

According to Wawrzek Hyska, product manager at Google, and Oleh Khoma, software engineer at Google, the feature works by automatically handling data movement required to execute queries across regions. "In the background, BigQuery automatically handles the data movement required to execute the query, giving you a seamless, zero-ETL experience for multi-location analytics," they explain. The system identifies different parts of the query that must be executed in different regions and runs them accordingly, then transfers results to a selected location while attempting to minimize transfer size.

For example, developers can now combine transaction data from Europe and Asia with customer data from the US using a standard SQL query. While similar results could be achieved using ETL pipelines to copy and centralize data before running SQL statements, the new feature lets BigQuery run queries across data in different regions directly, making data analysis simpler and faster.

However, global queries come with trade-offs. They incur higher latency than single-region queries due to the time required to transfer data between regions. Additionally, the feature brings additional costs and challenges, and regulations might even prohibit data leaving the original location. Developers must explicitly opt in by specifying the location where a global query is executed.

The new feature allows data engineers to control where data is processed, aligning with their data residency and compliance requirements. "What's different is that BigQuery now executes it across datasets that are thousands of miles apart," Hyska and Khoma note. "This both dramatically simplifies your architecture and accelerates your time to insight."

Google Cloud is not alone in offering options to query distributed data with a single SQL statement. AWS provides cross-region data sharing for Amazon Redshift, and Athena can query data across regions, but it does not automatically coordinate distributed execution across regions the way BigQuery global queries do.

To enable global queries, data engineers must update the project or organization configuration by setting enable_global_queries_execution to true in the region where the query runs and enable_global_queries_data_access to true in the regions where the data is stored. Queries can run in one project and access data from regions in another project, without using any cache to avoid transferring data between regions.

The cost structure for global queries includes the compute cost of each subquery in remote regions based on local pricing, the cost of the final query in the region where it runs, the cost of copying data between regions under data replication pricing, and the cost of storing copied data in the primary region for eight hours.

This preview feature represents a significant step toward simplifying distributed data analytics, though organizations will need to carefully weigh the benefits against the additional latency, costs, and compliance considerations before adopting it for production workloads.

Author photo

#BigQuery #Cross-Region SQL #Distributed Data #Data Analytics #Cloud Services

Google BigQuery Previews Cross-Region SQL Queries for Distributed Data

Comments