Skip to main content
πŸ“–

Key Terms β€” Clustering

CK

Clustering Key

A user-defined set of columns (or expressions) used by Snowflake to co-locate related data in the same micro-partitions for better pruning.

β€”

Cluster Depth

A metric that measures how many overlapping micro-partitions exist for a clustering key's values. Lower depth = better clustering.

AC

Automatic Clustering

A serverless Snowflake service that continuously re-clusters data in the background. Credits charged per TB reclustered.

β€”

Natural Clustering

Data is naturally ordered by insertion order. Tables loaded chronologically are naturally clustered on timestamp columns.

β€”

SYSTEM$CLUSTERING_INFORMATION

Built-in function that returns clustering metadata including cluster depth, total partitions, and partition overlap for a table.


Why Clustering Matters

Micro-partition pruning is Snowflake’s most impactful optimisation. When data is well-clustered on the columns you filter most often, Snowflake can skip the vast majority of partitions β€” resulting in faster queries and lower credit consumption.

Well-Clustered vs Poorly-Clustered Data

Two diagrams side by side. Left: Well-clustered table where micro-partitions MP1-MP4 each contain a distinct date range (Jan, Feb, Mar, Apr). A WHERE clause for February scans only MP2 β€” 75% pruning. Right: Poorly-clustered table where each partition MP1-MP4 contains a mix of all months. The same WHERE clause must scan all 4 partitions β€” 0% pruning. Both have the same data, but clustering makes the difference.

Comparison: well-clustered data with distinct ranges per partition vs poorly-clustered with mixed ranges
🎯Exam Focus: When Clustering Helps

Clustering keys are most valuable on large tables (hundreds of GB to multi-TB) where queries consistently filter on the same column(s). If the Query Profile shows most partitions being scanned despite selective WHERE clauses, clustering is the solution.


Natural Clustering

Data inserted into a Snowflake table is naturally sorted by insertion order. For tables loaded chronologically (e.g., daily batch loads of events), data is naturally clustered on timestamp columns.

πŸ’‘Natural Clustering Is Free

If your queries predominantly filter by date/time and your data is loaded in chronological order, you may not need an explicit clustering key at all. Snowflake’s natural ordering already provides good pruning. Check pruning stats in the Query Profile before adding keys.


Cluster Depth

Cluster depth measures how overlapping micro-partitions are for a given column’s value range. Lower depth = better clustering.

Checking Cluster Depth
1-- Check clustering information for a table and column
2SELECT SYSTEM$CLUSTERING_INFORMATION('sales.orders', '(order_date)');
3
4-- Returns JSON with:
5-- cluster_depth: average overlap depth (lower = better)
6-- total_partition_count: total micro-partitions
7-- total_constant_partition_count: partitions with single distinct value
8-- average_overlaps: average number of overlapping partitions
9
10-- Check depth for multi-column key
11SELECT SYSTEM$CLUSTERING_INFORMATION('sales.orders', '(region, order_date)');
12
13-- Quick numeric depth check
14SELECT SYSTEM$CLUSTERING_DEPTH('sales.orders', '(order_date)');

Cluster Depth Visualisation

Diagram showing micro-partitions as horizontal bars on a number line representing column values. Well-clustered: bars are non-overlapping or minimally overlapping β€” cluster depth = 1. Poorly-clustered: bars heavily overlap, requiring scans of 4-5 partitions for any single value range β€” cluster depth = 4-5. Arrow showing that Automatic Clustering moves from high depth to low depth over time.

Horizontal bars representing partition value ranges: non-overlapping (depth 1) vs heavily overlapping (depth 5)

When to Add a Clustering Key

Should You Add a Clustering Key?

Feature
Yes β€” Add Clustering
No β€” Skip Clustering
Table size
Hundreds of GB or multi-TB
Under 10 GB
Query pattern
Consistent filters on same column(s)
Varied, unpredictable filters
Pruning effectiveness
Poor pruning in Query Profile
Good pruning already
DML pattern
Append-heavy, infrequent updates
Frequent updates/deletes
Table type
Permanent production tables
Transient/temporary staging tables
⚠️When NOT to Cluster

Small tables: Snowflake already scans them efficiently without clustering. Frequently updated tables: constant DML causes continuous reclustering, incurring high serverless credit costs. Very low cardinality columns (booleans, small enums): not enough distinct values to benefit from clustering.


CLUSTER BY Syntax

Creating and Managing Clustering Keys
1-- Add clustering key at table creation
2CREATE TABLE sales.orders (
3order_id INT,
4customer_id INT,
5order_date DATE,
6region VARCHAR,
7total DECIMAL(12,2)
8) CLUSTER BY (order_date, region);
9
10-- Add clustering key to existing table
11ALTER TABLE sales.orders CLUSTER BY (order_date, region);
12
13-- Use an expression as clustering key
14ALTER TABLE events CLUSTER BY (TO_DATE(event_timestamp));
15
16-- View current clustering key
17SHOW TABLES LIKE 'ORDERS' IN SCHEMA sales;
18
19-- Drop clustering key
20ALTER TABLE sales.orders DROP CLUSTERING KEY;
21
22-- Manual recluster (prefer Automatic Clustering instead)
23ALTER TABLE sales.orders RECLUSTER;
πŸ”‘Column Order Matters

In CLUSTER BY (col1, col2), col1 is the primary sort dimension and col2 is secondary. Choose the column used most frequently in WHERE clauses as the first column. A common pattern is CLUSTER BY (date_col, high_cardinality_col).


Automatic Clustering

When you define a clustering key, Snowflake’s Automatic Clustering service continuously re-organises data in the background to maintain optimal clustering.

Setting Up Automatic Clustering

1
Identify the target table

Choose a large table with poor pruning on frequently filtered columns. Check current cluster depth with SYSTEM$CLUSTERING_INFORMATION.

SQL
SELECT SYSTEM$CLUSTERING_INFORMATION('sales.orders', '(order_date)');
2
Add the clustering key

This automatically enables Automatic Clustering. Snowflake begins reclustering in the background using serverless compute.

SQL
ALTER TABLE sales.orders CLUSTER BY (order_date, region);
πŸ’‘The initial recluster of a large table may consume significant credits. Monitor the first few days closely.
3
Monitor credit consumption

Check how many credits Automatic Clustering is consuming. If costs are excessive, consider whether the clustering key is appropriate.

SQL
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
ORDER BY credits_used DESC;
4
Verify improvement

Re-check cluster depth and run your typical queries. Confirm improved pruning in the Query Profile.

SQL
SELECT SYSTEM$CLUSTERING_DEPTH('sales.orders', '(order_date)');

Clustering Keys vs Search Optimisation Service

Clustering Keys vs SOS

Feature
Clustering Keys
Search Optimisation Service
Best for
Range queries and large table scans
Point lookups and equality predicates
How it works
Re-sorts data into better micro-partitions
Builds persistent search access paths
Setup cost
βœ“Free to define β€” Automatic Clustering has serverless cost
Build + maintenance serverless credits
Query type
WHERE date BETWEEN x AND y
WHERE id = 12345 or ILIKE '%john%'
Minimum edition
βœ“All editions
Enterprise+
Can coexist?
Yes β€” complementary
Yes β€” complementary

Cheat Sheet

πŸ“‹
Quick Reference

Clustering Keys Quick Reference

⌨️
Key Commands
Add
ALTER TABLE t CLUSTER BY (col1, col2)
Drop
ALTER TABLE t DROP CLUSTERING KEY
Check
SYSTEM$CLUSTERING_INFORMATION('t', '(col)')β€” Returns JSON
Depth
SYSTEM$CLUSTERING_DEPTH('t', '(col)')β€” Numeric
Manual
ALTER TABLE t RECLUSTERβ€” Prefer auto
History
AUTOMATIC_CLUSTERING_HISTORY viewβ€” Credits used
βœ…
Best Practices
Size
Only cluster tables > 100 GB
Columns
Most-filtered column first in CLUSTER BY
Expressions
CLUSTER BY (TO_DATE(ts_col)) for timestamps
Avoid
Don't cluster staging/temp tables or small tables

Practice Quiz

Clustering

A 5 TB table filtered on region and order_date shows 95% of partitions scanned in Query Profile. What is the BEST first action?

Clustering

Which Snowflake service continuously maintains clustering in the background without a user warehouse?

Clustering

A table has CLUSTER BY (region). The region column has only 4 distinct values. Is this an effective clustering key?


Flashcards

Clustering
QUESTION

What does cluster depth measure?

Click to reveal answer
ANSWER

Cluster depth measures how many overlapping micro-partitions exist for a clustering key's values. Lower depth = better clustering = better pruning. Check with SYSTEM$CLUSTERING_INFORMATION or SYSTEM$CLUSTERING_DEPTH.

Click to see question
Clustering
QUESTION

What is the difference between natural clustering and explicit clustering keys?

Click to reveal answer
ANSWER

Natural clustering: data ordered by insertion time automatically. Free. Explicit clustering keys: user-defined columns for Snowflake to sort data by β€” enables Automatic Clustering (serverless credit cost). Natural clustering works well for time-series; explicit keys needed when queries filter on non-time columns.

Click to see question
Clustering
QUESTION

What are the credit implications of Automatic Clustering?

Click to reveal answer
ANSWER

Automatic Clustering runs as a serverless service β€” no user warehouse needed. Credits are consumed per TB of data reclustered. Initial reclustering of a large table can be expensive. Monitor with AUTOMATIC_CLUSTERING_HISTORY view.

Click to see question
Clustering
QUESTION

Can you use expressions in a CLUSTER BY definition?

Click to reveal answer
ANSWER

Yes. CLUSTER BY (TO_DATE(timestamp_col)) is valid. Expressions are evaluated and the result is used for clustering. Useful for clustering high-precision timestamps by date only.

Click to see question
Clustering
QUESTION

What is the relationship between clustering and the Search Optimisation Service?

Click to reveal answer
ANSWER

They are complementary, not alternatives. Clustering improves range query pruning (WHERE date BETWEEN x AND y). SOS accelerates point lookups (WHERE id = 123, ILIKE). Both can be active on the same table. SOS requires Enterprise+.

Click to see question

Resources


Next Steps

Reinforce what you just read

Study the All flashcards with spaced repetition to lock it in.

Study flashcards β†’