← Retour au blog
Data Engineering6 min de lecture

TABLESAMPLE in BigQuery: How I Stopped Querying Billions of Rows to Test My SQL"

A practical guide to using TABLESAMPLE for faster query development in BigQuery without burning through your quota or budget.

EB

Eric Badarou

bigquerysqlperformancetesting

I spent way too long learning this the hard way: you don't need to scan your entire 500GB table just to check if your CASE WHEN logic works correctly.

Last month, I was building a new attribution model on our events table. Standard stuff — about 2 billion rows, ~400GB uncompressed. Every time I tweaked my SQL and hit "Run", I'd wait 40 seconds and watch $0.20 disappear. After the 15th iteration (yeah, I'm bad at SQL), I did the math: I'd just burned through $3 testing syntax.

That's when I actually read the docs and discovered TABLESAMPLE.

What TABLESAMPLE Actually Does

The concept is simple: BigQuery randomly samples a percentage of your table before running your query. Instead of scanning 2 billion rows, you scan 20 million (1%). Your query runs in 2 seconds instead of 40, and costs pennies.

Here's the basic syntax:

SELECT 
  user_id,
  event_name,
  COUNT(*) as event_count
FROM 
  `project.dataset.events` TABLESAMPLE SYSTEM (1 PERCENT)
WHERE 
  event_date >= '2026-01-01'
GROUP BY 
  user_id, event_name

The SYSTEM (1 PERCENT) tells BigQuery to randomly sample 1% of the table's blocks. Not exactly 1% of rows — we'll get to that in a sec.

When This Actually Helps

TABLESAMPLE shines when you're in development mode:

  • Testing JOIN logic on large tables
  • Checking if your WINDOW function works before running it on the full dataset
  • Validating data quality on a new pipeline
  • Prototyping a dashboard query

I now have a standard workflow: write the query with TABLESAMPLE SYSTEM (0.1 PERCENT), make sure it doesn't error out, bump to 1 PERCENT to validate the logic looks sane, then remove it for the final run.

For my attribution model, the 1% sample gave me ~18 million rows to work with. Plenty to see if my attribution windows were correct, if I was handling nulls properly, etc.

The Block-Level Sampling Gotcha

Here's the thing nobody tells you upfront: SYSTEM sampling works at the storage block level, not the row level.

BigQuery stores data in blocks (chunks of compressed columnar data). When you do TABLESAMPLE SYSTEM (10 PERCENT), it randomly picks 10% of those blocks and scans them entirely.

This means:

  1. You won't get exactly 10% of rows — you'll get roughly 10%, but it varies
  2. If your table is small (< 1GB), you might get 0 rows or way more than expected
  3. The sampling is deterministic per query execution, but not guaranteed across runs

Let me show you what I mean:

-- Run this a few times
SELECT COUNT(*) as sampled_rows
FROM `project.dataset.big_table` TABLESAMPLE SYSTEM (1 PERCENT)

First run: 1,234,567 rows
Second run: 1,198,432 rows
Third run: 1,267,891 rows

All close to 1%, but not identical. For development work, this is totally fine. For statistical analysis or ML training, you'd want to be more careful.

Row-Level Sampling (When You Need Precision)

If you actually need exactly X% of rows — say, for splitting train/test datasets — you can do row-level sampling manually:

SELECT *
FROM `project.dataset.events`
WHERE RAND() < 0.01  -- exactly 1% of rows, probabilistically

But here's the catch: this still scans the full table. You're paying for 100% of the data, just filtering most of it out. So for cost optimization during development, it's useless.

The one time I use RAND() is when I need reproducible sampling:

WHERE ABS(MOD(FARM_FINGERPRINT(user_id), 100)) < 10  -- always the same 10% of users

This is deterministic (same user_id always falls in or out of the sample) and doesn't scan extra data if you combine it with TABLESAMPLE:

FROM 
  `project.dataset.events` TABLESAMPLE SYSTEM (10 PERCENT)
WHERE 
  ABS(MOD(FARM_FINGERPRINT(user_id), 100)) < 10

Now you're sampling blocks AND deterministically filtering users. Best of both worlds for testing user-level aggregations.

Partitioned Tables: The Fast Path

If your table is partitioned (and it should be), combine TABLESAMPLE with partition filters:

SELECT *
FROM 
  `project.dataset.events` TABLESAMPLE SYSTEM (1 PERCENT)
WHERE 
  event_date = '2026-02-12'  -- only scans yesterday's partition

This is chef's kiss for cost control. You're sampling 1% of one day's data instead of 1% of three years of data.

On our events table, querying all partitions with 1% sample scans ~4GB. Adding a single-day partition filter drops it to ~15MB. The query finishes in under a second.

What TABLESAMPLE Doesn't Help With

A few scenarios where this won't save you:

1. Small tables
If your table is 500MB, sampling 10% might give you one block or zero blocks. You're better off just querying the whole thing.

2. Final production queries
Obviously. TABLESAMPLE is for development only. Don't put it in your dbt models or scheduled queries.

3. Aggregations that need full accuracy
If you're calculating company revenue, don't sample. If you're checking that your COALESCE logic works, go ahead.

4. Tables with extreme skew
If 90% of your data is in one partition and you sample 1%, you might get nothing from the important partition. Be aware of your data distribution.

My Current Setup

I have a Sublime Text snippet that inserts TABLESAMPLE SYSTEM (1 PERCENT) after the table name when I type tbs. Then I just delete it before committing the query to dbt.

In dbt, I've thought about adding a macro that auto-enables sampling in dev:

-- macros/sample_in_dev.sql
{% macro sample_in_dev(percent=1) %}
  {% if target.name == 'dev' %}
    TABLESAMPLE SYSTEM ({{ percent }} PERCENT)
  {% endif %}
{% endmacro %}

Then in models:

FROM {{ ref('big_table') }} {{ sample_in_dev(1) }}

Haven't pulled the trigger on this yet because I worry about edge cases where sampling breaks the logic in non-obvious ways. But for pure syntax checking, it's tempting.

The Bottom Line

Since I started using TABLESAMPLE religiously, my BigQuery costs during development dropped by ~70%. More importantly, iteration speed went up — I'm not sitting around waiting 30 seconds for a query that's going to fail because I forgot a comma.

The gotcha is remembering that it's block-level, not row-level. For 99% of development work, that's fine. For the 1% where it matters, combine it with partition filters or use deterministic user sampling.

Anyway, if you're still scanning billions of rows to test if your LEFT JOIN should've been an INNER JOIN, give TABLESAMPLE a shot. Your wallet will thank you.

Cet article vous a plu ?
0

Vous avez un projet data ?

Discutons de vos besoins en analytics engineering et data.

Me contacter