November 28, 2024

Understanding and Managing Projects with Ad-Hoc Usage: A Better Fit for the BigQuery On-Demand Billing Model

Yuliia Tkachova
Co-founder & CEO, Masthead Data


BigQuery’s on-demand billing model offers a flexible and cost-effective solution for ad-hoc querying needs. However, understanding and managing projects with ad-hoc usage is crucial to avoid unexpected costs. This article delves into key strategies to optimize your BigQuery usage and minimize expenses, providing practical tips and best practices.

On-demand is a more traditional billing type that is fairly straightforward to manage.
Interestingly, the maximum number of allocated slots for the project using compute on-demand is 2000. This is a lot, but it is also worth noting that when you work with TiB, Google Cloud’s internal limits are still in slots.

Pro Tip: The best way to make BigQuery efficient with on-demand prices is to avoid surprises. If you have not implemented a limit on the cost of queries, be sure to do it. 

Use the maximum bytes billed setting to limit the query cost. Going above the limit will cause the query to fail without incurring the cost of the query, as shown below.

Pro Tip: In addition to enabling cost control on a query level, you can apply similar logic to the user and project levels.

Also, an undocumented limitation worth knowing is the On-demand pricing model. It has a limit on the resources (CPU) required to run queries. Below is an example of a relatively complicated join of two tables (200K rows each). It used 135086 CPU seconds but would charge only 88M Analysis bytes, exceeding the ratio supported by the On-demand pricing model. To continue doing this type of workload, consider moving it to a capacity-based pricing model (Editions), which does not have such a limit. 135086 CPU seconds were used, and this query must use less than 22500 CPU seconds.

Error mistake from BigQuery

WITH apps_flyer as (
  SELECT event_timestamp AS time_appsflyerr,
    android_id AS aid,
    device_id as asid,
    gaid,
    idfa,
    idfv,
    partner,
    campaign_name,
  FROM appsflyer_user_level_attribution 
),
registrations as (
  SELECT profile_id,
    aid,
    asid,
    gaid,
    idfa,
    idfv
  FROM user_registrations
  )
SELECT *
FROM registrations a left join apps_flyer b 
ON 
  CASE 
    WHEN b.asid IS NOT NULL THEN a.asid = b.asid
    WHEN b.aid is not null THEN a.aid = b.aid
    WHEN b.gaid is not null THEN a.gaid = b.gaid
    WHEN b.idfa is not null THEN a.idfa = b.idfa
    WHEN b.idfa is not null THEN a.idfa = b.idfa 
  END; 

At Masthead, we suggest our customers and users keep on-demand projects for the ad-hoc querying and not loaded batch process. Speaking of ad-hoc querying to reduce the compute cost, we recommend sharing this set of rules with all the users who have read access to BigQuery in the organization. 

  1. There is NO “SELECT *” policy. Users need to develop the habit of selecting only the columns they need.
SELECT * EXPECT (dim1, dim2)
FROM dataset.table
  1. Partition the tables when they get large, and make sure that you use the partition key in your queries to reduce the size of your table scans to reduce your costs. You can use Recommender in Google Cloud Console, which today has recommendations for BigQuery for partitioning, clustering, or materialized views for improved query performance and cost efficiency.
  2. The ultimate idea behind every query should be to use less data for analysis. It could be done in various ways; here is one of the best practices:
    – Test your query against a subset of the data (e.g., a single day) before running it against the entire table. Operations like table sampling can help you get a sense of the data without scanning the entire table or even partitioning it.

    Note: The SAMPLING function is more cost efficient compared to LIMIT. This will drastically reduce the amount of data scanned thus the amount of bytes billed.
SELECT *
FROM dataset.my_table 
TABLESAMPLE SYSTEM(10 PERCENT)
  1. Extracting all the columns and rows you need in a smaller temporary or permanent table and then building your queries against that. 
  1. Use Approximate aggregate functions. If you do not need pharmacy-level precision in a large data set, you can use APPROX_COUNT_DISTINCT() instead of Operations like COUNT(DISTINCT …), which demands a lot of memory processing. 
APPROX_COUNT_DISTINCT (expression)
  1. Create a Search Index statement for one or more columns in a table, which helps you increase the efficiency when searching for particular data in a large table. 
CREATE SEARCH INDEX [IF NOT EXISTS] index_name
ON table_name({ALL COLUMNS | column_name [,...]})
[OPTIONS(index_option_list)]
  1. Before running a query, users can now perform a Dry Run to estimate the query cost and its performance. It does not charge you either slots or bytes, but allows you to validate the query and estimate costs without running it. It is worth looking into it because BigQuery charges for queries that were not actually executed and had an error when using the Editions model. 
  1. There is an option to run a Short Query Optimized Mode for small queries. The BigQuery team introduced the Short Query Optimized Mode to improve the overall latency of short queries that are common in workloads, such as dashboards or data exploration. The main difference between the Short Query Optimized Mode and the Regular Mode is that the former is designed only for short/light weighted queries, and if the GBQ engines determine that it could be clarified as a Short Query, the mode runs the query without creating a Job, so there won’t be any jobID in the Information_Scema, while the queryID will still remain. Potentially, it is a much lighter and cheaper workload compared to regular queries, though there is no proof of it yet since the functionality is in Pre-GA.

    To use Short Query Optimized Mode, you need to enable it:

Screenshot from Google BigQuery Studio

  1.  Leverage Materialized Views. Materialized Views are pre-computed snapshots of the data. While they still require resources to complete analysis, such views are faster since they already have pre-computed results. This reduces the cost of analysis, as the BigQuery engine does not need to compute all the data from the underlying table. It is a perfect match for resource-intensive queries, and it simplifies and optimizes workflows as well.

    Additional points to consider when using Materialized Views:
  • Its frequency of updates. We recommend reviewing the frequency of updates of the upstream sources at least two steps ahead. 
  • Query Caching. Enabling query caching for the Materialized Views can further optimize performance for frequently used queries.
  • Heads-up Standard Edition plan can only query existing Materialized Views but not manage, refresh, or create new ones.

By implementing these strategies, you can significantly improve the efficiency and cost-effectiveness of your BigQuery on-demand projects. Remember to focus on optimizing query performance, leveraging features like partitioning, clustering, and materialized views, and adopting best practices for data analysis. By following these guidelines, you can maximize the value of BigQuery while minimizing your costs.