📖

Key Terms — Clustering

Clustering Key

A user-defined set of columns (or expressions) used by Snowflake to co-locate related data in the same micro-partitions for better pruning.

—

Cluster Depth

A metric that measures how many overlapping micro-partitions exist for a clustering key's values. Lower depth = better clustering.

Automatic Clustering

A serverless Snowflake service that continuously re-clusters data in the background. Credits charged per TB reclustered.

—

Natural Clustering

Data is naturally ordered by insertion order. Tables loaded chronologically are naturally clustered on timestamp columns.

—

SYSTEM$CLUSTERING_INFORMATION

Built-in function that returns clustering metadata including cluster depth, total partitions, and partition overlap for a table.

Why Clustering Matters

Micro-partition pruning is Snowflake’s most impactful optimisation. When data is well-clustered on the columns you filter most often, Snowflake can skip the vast majority of partitions — resulting in faster queries and lower credit consumption.

Well-Clustered vs Poorly-Clustered Data

Two diagrams side by side. Left: Well-clustered table where micro-partitions MP1-MP4 each contain a distinct date range (Jan, Feb, Mar, Apr). A WHERE clause for February scans only MP2 — 75% pruning. Right: Poorly-clustered table where each partition MP1-MP4 contains a mix of all months. The same WHERE clause must scan all 4 partitions — 0% pruning. Both have the same data, but clustering makes the difference.

Comparison: well-clustered data with distinct ranges per partition vs poorly-clustered with mixed ranges

🎯Exam Focus: When Clustering Helps

Clustering keys are most valuable on large tables (hundreds of GB to multi-TB) where queries consistently filter on the same column(s). If the Query Profile shows most partitions being scanned despite selective WHERE clauses, clustering is the solution.

Natural Clustering

Data inserted into a Snowflake table is naturally sorted by insertion order. For tables loaded chronologically (e.g., daily batch loads of events), data is naturally clustered on timestamp columns.

💡Natural Clustering Is Free

If your queries predominantly filter by date/time and your data is loaded in chronological order, you may not need an explicit clustering key at all. Snowflake’s natural ordering already provides good pruning. Check pruning stats in the Query Profile before adding keys.

Cluster Depth

Cluster depth measures how overlapping micro-partitions are for a given column’s value range. Lower depth = better clustering.

Checking Cluster Depth

1-- Check clustering information for a table and column
2SELECT SYSTEM$CLUSTERING_INFORMATION('sales.orders', '(order_date)');
3
4-- Returns JSON with:
5-- cluster_depth: average overlap depth (lower = better)
6-- total_partition_count: total micro-partitions
7-- total_constant_partition_count: partitions with single distinct value
8-- average_overlaps: average number of overlapping partitions
9
10-- Check depth for multi-column key
11SELECT SYSTEM$CLUSTERING_INFORMATION('sales.orders', '(region, order_date)');
12
13-- Quick numeric depth check
14SELECT SYSTEM$CLUSTERING_DEPTH('sales.orders', '(order_date)');

Cluster Depth Visualisation

Diagram showing micro-partitions as horizontal bars on a number line representing column values. Well-clustered: bars are non-overlapping or minimally overlapping — cluster depth = 1. Poorly-clustered: bars heavily overlap, requiring scans of 4-5 partitions for any single value range — cluster depth = 4-5. Arrow showing that Automatic Clustering moves from high depth to low depth over time.

Horizontal bars representing partition value ranges: non-overlapping (depth 1) vs heavily overlapping (depth 5)

When to Add a Clustering Key

Should You Add a Clustering Key?

Feature

Yes — Add Clustering

No — Skip Clustering

Table size

Hundreds of GB or multi-TB

Under 10 GB

Query pattern

Consistent filters on same column(s)

Varied, unpredictable filters

Pruning effectiveness

Poor pruning in Query Profile

Good pruning already

DML pattern

Append-heavy, infrequent updates

Frequent updates/deletes

Table type

Permanent production tables

Transient/temporary staging tables

⚠️When NOT to Cluster

Small tables: Snowflake already scans them efficiently without clustering. Frequently updated tables: constant DML causes continuous reclustering, incurring high serverless credit costs. Very low cardinality columns (booleans, small enums): not enough distinct values to benefit from clustering.

CLUSTER BY Syntax

Creating and Managing Clustering Keys

1-- Add clustering key at table creation
2CREATE TABLE sales.orders (
3order_id INT,
4customer_id INT,
5order_date DATE,
6region VARCHAR,
7total DECIMAL(12,2)
8) CLUSTER BY (order_date, region);
9
10-- Add clustering key to existing table
11ALTER TABLE sales.orders CLUSTER BY (order_date, region);
12
13-- Use an expression as clustering key
14ALTER TABLE events CLUSTER BY (TO_DATE(event_timestamp));
15
16-- View current clustering key
17SHOW TABLES LIKE 'ORDERS' IN SCHEMA sales;
18
19-- Drop clustering key
20ALTER TABLE sales.orders DROP CLUSTERING KEY;
21
22-- Manual recluster (prefer Automatic Clustering instead)
23ALTER TABLE sales.orders RECLUSTER;

🔑Column Order Matters

In CLUSTER BY (col1, col2), col1 is the primary sort dimension and col2 is secondary. Choose the column used most frequently in WHERE clauses as the first column. A common pattern is CLUSTER BY (date_col, high_cardinality_col).

Automatic Clustering

When you define a clustering key, Snowflake’s Automatic Clustering service continuously re-organises data in the background to maintain optimal clustering.

Setting Up Automatic Clustering

Identify the target table

Choose a large table with poor pruning on frequently filtered columns. Check current cluster depth with SYSTEM$CLUSTERING_INFORMATION.

SQL

SELECT SYSTEM$CLUSTERING_INFORMATION('sales.orders', '(order_date)');

Add the clustering key

This automatically enables Automatic Clustering. Snowflake begins reclustering in the background using serverless compute.

SQL

ALTER TABLE sales.orders CLUSTER BY (order_date, region);

💡The initial recluster of a large table may consume significant credits. Monitor the first few days closely.

Monitor credit consumption

Check how many credits Automatic Clustering is consuming. If costs are excessive, consider whether the clustering key is appropriate.

SQL

SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
ORDER BY credits_used DESC;

Verify improvement

Re-check cluster depth and run your typical queries. Confirm improved pruning in the Query Profile.

SQL

SELECT SYSTEM$CLUSTERING_DEPTH('sales.orders', '(order_date)');

Clustering Keys vs Search Optimisation Service

Clustering Keys vs SOS

Feature

Clustering Keys

Search Optimisation Service

Best for

Range queries and large table scans

Point lookups and equality predicates

How it works

Re-sorts data into better micro-partitions

Builds persistent search access paths

Setup cost

✓Free to define — Automatic Clustering has serverless cost

Build + maintenance serverless credits

Query type

WHERE date BETWEEN x AND y

WHERE id = 12345 or ILIKE '%john%'

Minimum edition

✓All editions

Enterprise+

Can coexist?

Yes — complementary

Cheat Sheet

📋

Quick Reference

Clustering Keys Quick Reference

⌨️

Key Commands

Add

ALTER TABLE t CLUSTER BY (col1, col2)

Drop

ALTER TABLE t DROP CLUSTERING KEY

Check

SYSTEM$CLUSTERING_INFORMATION('t', '(col)')— Returns JSON

Depth

SYSTEM$CLUSTERING_DEPTH('t', '(col)')— Numeric

Manual

ALTER TABLE t RECLUSTER— Prefer auto

History

AUTOMATIC_CLUSTERING_HISTORY view— Credits used

✅

Best Practices

Size

Only cluster tables > 100 GB

Columns

Most-filtered column first in CLUSTER BY

Expressions

CLUSTER BY (TO_DATE(ts_col)) for timestamps

Avoid

Don't cluster staging/temp tables or small tables

Practice Quiz

Clustering

A 5 TB table filtered on region and order_date shows 95% of partitions scanned in Query Profile. What is the BEST first action?

Clustering

Which Snowflake service continuously maintains clustering in the background without a user warehouse?

Clustering

A table has CLUSTER BY (region). The region column has only 4 distinct values. Is this an effective clustering key?

Flashcards

Clustering

QUESTION

What does cluster depth measure?

Click to reveal answer

ANSWER

Cluster depth measures how many overlapping micro-partitions exist for a clustering key's values. Lower depth = better clustering = better pruning. Check with SYSTEM$CLUSTERING_INFORMATION or SYSTEM$CLUSTERING_DEPTH.

Click to see question

Clustering

QUESTION

What is the difference between natural clustering and explicit clustering keys?

Click to reveal answer

ANSWER

Natural clustering: data ordered by insertion time automatically. Free. Explicit clustering keys: user-defined columns for Snowflake to sort data by — enables Automatic Clustering (serverless credit cost). Natural clustering works well for time-series; explicit keys needed when queries filter on non-time columns.

Click to see question

Clustering

QUESTION

What are the credit implications of Automatic Clustering?

Click to reveal answer

ANSWER

Automatic Clustering runs as a serverless service — no user warehouse needed. Credits are consumed per TB of data reclustered. Initial reclustering of a large table can be expensive. Monitor with AUTOMATIC_CLUSTERING_HISTORY view.

Click to see question

Clustering

QUESTION

Can you use expressions in a CLUSTER BY definition?

Click to reveal answer

ANSWER

Yes. CLUSTER BY (TO_DATE(timestamp_col)) is valid. Expressions are evaluated and the result is used for clustering. Useful for clustering high-precision timestamps by date only.

Click to see question

Clustering

QUESTION

What is the relationship between clustering and the Search Optimisation Service?

Click to reveal answer

ANSWER

They are complementary, not alternatives. Clustering improves range query pruning (WHERE date BETWEEN x AND y). SOS accelerates point lookups (WHERE id = 123, ILIKE). Both can be active on the same table. SOS requires Enterprise+.

Click to see question

Resources

Next Steps

Reinforce what you just read

Study the All flashcards with spaced repetition to lock it in.

Study flashcards →