Beyond basic optimization lies a realm of advanced techniques that can significantly enhance BigQuery’s performance and cost-efficiency. This section explores sophisticated approaches to data organization and query optimization, including strategic clustering, effective use of primary/foreign keys, and smart handling of temporary tables. These techniques, while sometimes overlooked, can provide substantial benefits in terms of query performance and cost reduction when implemented correctly.
Here is next steps that, you can try:
- Clustering
Every BigQuery cost and compute optimization guide highlights clustering as one of the primary methods for reducing costs and improving the efficiency in BigQuery. I won’t rehash the official documentation, which is already well done, but I will cover some key points:
- Clustering Columns: You can cluster data using up to four columns, and the order of these columns matters. Columns frequently used in filters, joins, string comparisons, or aggregations are ideal candidates for clustering.
- Cardinality Considerations: The goal is to select columns with a balanced range of unique values to avoid overly broad or overly specific clusters.
- Performance Impact: Clustering improves query efficiency by reducing the amount of data scanned. However, the effectiveness of clustering depends on the chosen columns and their order.
In essence, clustering in BigQuery functions like a cluster index that BigQuery manages automatically. It helps reduce the data scanned during queries when filtering by rows, making it beneficial in both on-demand (less billed data scanned) and slot-based (less compute used) scenarios.
Practical Tips:
- Clustering New Data Only: When you start clustering, it won’t affect the data that is already in the table. If you need clustering to apply to historical data, you’ll need to create a new table.
- Table Size Consideration: If a table is less than 10GB, the efficiency gains from clustering may be insignificant.
- Impact on DML Operations: Clustering doesn’t perform well if there are many DML operations (update, delete, merge) on the table. Since BigQuery manages clustering rebuilds automatically, frequent updates can dramatically reduce clustering efficiency.
- Usage of primary/foreign key
The great power of primary/foreign keys is underestimated. Why do we recommend using them? The answer to this question is that query optimizers may use this information to optimize queries better.
Let’s recall the theory: In typical relational databases, primary keys and foreign keys are used to ensure data integrity. A primary key value is unique for each row and is not NULL. Each foreign key value in a row must be present in the primary key column of the primary key table or be NULL.
When defining a primary key during table creation or adding foreign keys when altering the table schema, users should use the NOT ENFORCED qualifier, as BigQuery does not yet support enforcement.
To put it simply, the BigQuery optimizer will eliminate the join (Inner Join, Left Outer Join, Right Outer Join) while creating a query plan for execution and limit tables to scan for better performance. This could only be possible due to the defined Key Constraints, but keep in mind the responsibility of maintaining it:
- Primary Key column values must be unique across all rows and cannot contain NULL values.
- Each Foreign Key must either be NULL or correspond to an existing Primary Key row in the referenced table.
- You can not create Primary\Foreign keys for the tables in different datasets, let alone different databases. This can be quite limiting if you have many datasets and need to join data across different data marts, such as financial and product data, for example.
- Output of a query as a temp table
It often happens that after executing a query, there’s a need to manipulate the output in various ways: re-ordering, additional aggregations, cleaning the data, etc. BigQuery offers several options for saving query results, such as exporting the output as a physical file, saving it to Google Sheets, or storing it in a BigQuery table. While these options are convenient, they come with limitations, especially regarding size, except for saving results in a BigQuery table, which incurs costs. These costs can accumulate if you frequently create “temporary” tables, leading to increased storage expenses.
Google BigQuery Screenshot
Pro tip: To avoid this issue is to leverage the temporary tables that BigQuery automatically creates after the query execution. You can easily access these tables through the BigQuery UI:
Google BigQuery Screenshot
The advanced optimization techniques covered in this section—from strategic clustering to efficient temporary table usage and smart sampling methods—form a comprehensive toolkit for BigQuery optimization. While some of these approaches might seem technical or complex at first, their implementation can lead to substantial improvements in both performance and cost-efficiency. The key is to understand not just how to implement these techniques, but when and where they’ll provide the most benefit. By carefully considering factors such as table size, query patterns, and data update frequency, you can make informed decisions about which optimization strategies will best serve your specific use case.
By clicking on the “Temporary table” link, you’ll be directed to a table containing your query results, which you can query like any regular table.
Google BigQuery Screenshot
For example, let’s say you’re querying an unclustered, non-partitioned table under an on-demand cost model with 750M rows:
Select *
From raw_events
Limit 100
Executing this query would cost approximately $4 (with billable bytes = 642.65 GB).
Google BigQuery Screenshot
If you need to make adjustments to the initial query and run it again, it will cost an additional ~$4 each time. However, by using the temporary table generated by the initial query, you can query the results at no extra cost ($0), as the cached data is already in the table. This approach prevents a full scan of the non-partitioned, non-clustered table, saving both time and money.
Google BigQuery Screenshot
- Limit operation usage
Usually, tables stored and managed in BigQuery are pretty large, so when we need only a table sample, there are several ways to achieve this. The most obvious method can be expensive, while others, which are less evident, are much more cost-effective:
- Free Method via the BigQuery Studio UI: You can get a sample of the data for free using the Preview tab in the table information screen. However, this method is limited because you can view all fields, but you cannot filter, sort, or exclude any data.
Google BigQuery Screenshot
- LIMIT Clause: The most obvious way to sample data is by using the LIMIT clause. However, this does not limit the amount of data processed. For example, if your table has 750 million rows and you run a query with LIMIT 1, it will still scan all the data but return only one row. The official Google documentation isn’t very clear on this; the only place where we have found a clarification is here.
- Sampling with TABLESAMPLE and RAND(): A less known but more effective way to sample data is by using TABLESAMPLE in combination with the LIMIT and the RAND() functions. This approach is more cost-effective and can return a random sample of the data at a lower cost. An example query might look like this (reference):
SELECT *
FROM some_table
TABLESAMPLE SYSTEM (10 PERCENT)
WHERE RAND() < 1
ORDER BY RAND()
LIMIT 100
This method allows you to get a randomly sampled subset of your data more efficiently and affordably.
The advanced optimization techniques covered in this section—from strategic clustering to efficient temporary table usage and smart sampling methods—form a comprehensive toolkit for BigQuery optimization. The key is to understand not just how to implement these techniques, but when and where they’ll provide the most benefit. By carefully considering factors such as table size, query patterns, and data update frequency, you can make informed decisions about which optimization strategies will best serve your specific use case.