December 19, 2024

Mastering BigQuery Cost Optimization: Insights from a LinkedIn Live Deep Dive

Yuliia Tkachova
Co-founder & CEO, Masthead Data

In a recent LinkedIn Live session, Yuliia, CEO and co-founder of MastHead Data, talked about rising challenge facing data teams today: the rising costs of Google BigQuery. With organizations seeing BigQuery consume up to 50% of their Google Cloud bills, understanding cost optimization has become crucial for data teams of all sizes.

The Google BigQuery Cost Challenge

The landscape of cloud data warehousing costs is shifting dramatically. What once represented a modest 5-10% of organizations’ Google Cloud bills has now grown to dominate cloud spending, with some companies seeing BigQuery account for up to half of their cloud costs. Through her experience at MastHead Data, where we help data teams understand their data platform usage through log analysis, Yuliia has observed a critical pattern: compute costs consistently make up 85-90% of total BigQuery spending.

This insight challenges the traditional focus on storage optimization and suggests a need to rethink cost-saving approaches. While storage optimization remains relevant, the real opportunity for significant cost reduction lies in optimizing compute usage.

Storage Optimization: The Foundation

Google’s storage model has evolved significantly, introducing important distinctions between active and long-term storage. Active storage covers data modified within the last 90 days, while long-term storage applies to data untouched for over 90 days. What makes this particularly interesting is the choice between physical and logical storage types.

Logical storage, while appearing cheaper at first glance, includes hidden benefits like time travel functionality and failed-safe features. These features allow teams to recover data from up to 7 days back and provide an additional 7-day buffer for data recovery through Google BigQuery support. Physical storage, though seemingly more expensive, can be more cost-effective when working with highly compressible data.

A key insight, that needs to remembered, is the importance of understanding storage billing units. BigQuery bills storage in gigabits, while Google Cloud Storage uses gigabytes. This distinction becomes crucial when considering data migration between services. For instance, moving data from BigQuery’s long-term logical storage to standard cloud storage could actually double storage costs – a detail often overlooked in optimization efforts.

Compute Optimization: The Game Changer

The most impactful portion of the session focused on compute optimization strategies. Google’s introduction of new pricing models, particularly the shift from traditional on-demand pricing to the new “editions” model, has created opportunities for significant cost savings. The editions model comes in three packages: Standard, Enterprise, and Enterprise Plus, each with its own capabilities and limitations.

Through real-world examples, Yuliia demonstrated how strategic use of these pricing models can transform BigQuery costs. One organization discovered they could reduce their monthly costs from $60,000 to $19,000 by switching specific workloads from on-demand to editions pricing. Even more dramatically, another case showed how a single pipeline’s cost could drop from $5,000 to just $58 through careful pricing model selection.

Strategic Implementation: A Practical Approach

Rather than advocating for massive changes, Yuliia recommended a measured, strategic approach to implementation. The key insight was that pipeline residency doesn’t have to match data residency – a concept that opens up new possibilities for cost optimization.

She outlined a practical strategy: start by identifying high-volume, frequent pipelines and move them to a separate project with optimized pricing. This approach allows teams to test and validate cost savings before making broader changes. When evaluating whether to switch to editions pricing, teams should consider:

  1. The current volume of data processed
  2. The frequency of high-volume pipelines
  3. The sustainability of current usage patterns
  4. The potential cost impact across different pricing models

Advanced Optimization Techniques

Once again focus on systemic approaches to cost optimization. Yuliia emphasized that effective cost optimization isn’t just about choosing the right pricing model or storage type – it’s about understanding the interplay between different aspects of your BigQuery usage.

For teams considering a switch to editions pricing, Yuliia recommended starting with a separate project for testing. This allows organizations to validate cost benefits without disrupting existing workflows. The editions model, particularly edition standard, comes with some limitations – such as a maximum of 1,600 slots compared to on-demand’s 2,000 slots – but these limitations often don’t impact typical workloads.

Future-Proofing Your Cost Strategy

As data volumes continue to grow and organizations become more dependent on cloud data warehouses, the ability to optimize costs while maintaining performance becomes increasingly crucial. Yuliia suggested that successful organizations will be those that develop a culture of cost awareness and continuous optimization.

The session concluded with practical recommendations for teams looking to start their optimization journey:

  • Begin with a thorough analysis of current usage patterns
  • Identify high-impact, frequent workloads for initial optimization
  • Test new pricing models in isolated projects
  • Monitor and measure the impact of changes
  • Gradually expand successful optimizations across the organization

For data teams looking to optimize their BigQuery costs, Live session provided a clear roadmap: understand your usage patterns, start with targeted optimizations, and build on successful changes. It’s an approach that combines technical knowledge with practical business sense, making it particularly valuable for organizations at any stage of their cloud journey. For teams looking to get started with storage optimization,

BONUS

Here is a helpful batch script (available here) that allows you to view storage models for all datasets in one place, making it easier to identify optimization opportunities.