Skip to main content

Data Storage & Clustering

Snowflake’s storage layer is fundamentally different from traditional databases. Understanding how micro-partitions work, how data is physically organised on disk, and how clustering influences query performance is essential knowledge for the COF-C02 exam.

🎯Exam Domain Coverage

This module covers the Database Objects & Storage domain of the COF-C02 exam. Expect 2–4 questions on micro-partition internals, clustering key configuration, and storage billing across table types. Pay particular attention to when clustering keys are β€” and are not β€” recommended.


1. Micro-Partitions: The Foundation of Snowflake Storage

Every table in Snowflake is divided into micro-partitions β€” the fundamental unit of Snowflake’s physical storage. Unlike traditional databases that require you to manually define partitions, Snowflake creates and sizes micro-partitions automatically with no intervention required.

Micro-Partition Architecture

Each Snowflake table is divided into contiguous micro-partitions. Data within each micro-partition is stored in columnar format, compressed independently, and catalogued with rich metadata in the Cloud Services layer. Queries skip entire micro-partitions when the predicate values fall outside the stored min/max range for a column.

Diagram showing a Snowflake table split into multiple micro-partitions, each containing columnar data and associated metadata including min, max, NDV, and null count per column.

Key Characteristics of Micro-Partitions

PropertyValue
Size (compressed)50 MB – 500 MB
Storage formatColumnar (PAX-style)
CreationAutomatic β€” no manual definition
ImmutabilityImmutable once written
CompressionAutomatic, per-column
MetadataStored in Cloud Services layer
πŸ”‘Micro-Partitions Are Immutable

Once a micro-partition is written, it cannot be modified. Any DML operation (UPDATE, DELETE, MERGE) causes Snowflake to write entirely new micro-partitions and mark the old ones as deleted. This is why Snowflake’s Time Travel is possible β€” old micro-partitions are retained for the duration of the Time Travel period.


2. Micro-Partition Metadata

Snowflake automatically tracks rich metadata for every micro-partition, stored at the column level in the Cloud Services layer. This metadata is the engine that powers query pruning.

Per-Column Metadata in Every Micro-Partition

For each column in each micro-partition, Snowflake records four key statistics: minimum value, maximum value, number of distinct values (NDV), and null count. This metadata is stored in the Cloud Services layer, not inside the micro-partition data files themselves, so querying it costs no compute.

Diagram showing a micro-partition record with four metadata fields per column: min_value, max_value, ndv (number of distinct values), and null_count, all stored outside the data files in the Cloud Services layer.

The Four Metadata Fields (Per Column, Per Micro-Partition)

  • Min value β€” the smallest value present in that column within this micro-partition
  • Max value β€” the largest value present in that column within this micro-partition
  • NDV (Number of Distinct Values) β€” the cardinality of the column within this micro-partition
  • Null count β€” how many null values exist in the column within this micro-partition
πŸ’‘Metadata is Free to Query

Because micro-partition metadata lives in the Cloud Services layer, Snowflake can evaluate whether to skip a micro-partition entirely before ever reading from cloud object storage. This means pruning decisions happen with zero virtual warehouse compute cost.


3. Query Pruning (Partition Elimination)

Query pruning is Snowflake’s mechanism for skipping micro-partitions that cannot possibly contain rows matching a query’s WHERE clause. This dramatically reduces I/O and accelerates analytical queries.

How Pruning Works

When you run a query with a filter predicate, Snowflake’s Cloud Services layer:

  1. Reads the metadata (min/max) for the filtered column across all micro-partitions
  2. Compares the predicate value to each micro-partition’s min/max range
  3. Marks any micro-partition where the predicate value falls outside the min/max range as pruned
  4. Sends only the surviving micro-partitions to the virtual warehouse for scanning
Example: Range Query That Benefits from Pruning
1-- Assume the ORDERS table has 10,000 micro-partitions
2-- and ORDER_DATE is stored in insertion order (chronological)
3SELECT
4 order_id,
5 customer_id,
6 total_amount
7FROM orders
8WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';
9
10-- Snowflake reads ORDER_DATE min/max for all 10,000 micro-partitions
11-- Micro-partitions where max(order_date) < '2024-01-01' β†’ PRUNED
12-- Micro-partitions where min(order_date) > '2024-03-31' β†’ PRUNED
13-- Only micro-partitions that OVERLAP the range are scanned
14-- Check pruning efficiency in Query Profile β†’ "Partitions scanned" vs "Partitions total"
ℹ️Viewing Pruning in Query Profile

After running a query, open the Query Profile in Snowsight. Under the TableScan operator you will see two metrics:

  • Partitions total β€” all micro-partitions in the table
  • Partitions scanned β€” micro-partitions that were actually read

A large difference between these two numbers indicates effective pruning. If partitions scanned β‰ˆ partitions total, your data may benefit from a clustering key.


4. Natural Clustering

Before you define any explicit clustering key, Snowflake naturally orders data within micro-partitions based on insertion order. This is called natural clustering.

Natural Clustering via Insertion Order

When rows are inserted chronologically (e.g., daily ETL loads appending new records), each new micro-partition naturally contains data from a specific time period. A query filtering on a recent date range will automatically prune older micro-partitions without any clustering key defined.

Diagram showing four micro-partitions labelled with date ranges: MP1 (Jan), MP2 (Feb), MP3 (Mar), MP4 (Apr). A query filtering WHERE month = 'Apr' prunes MP1, MP2, and MP3 automatically.

When Natural Clustering is Sufficient

  • Time-series tables loaded chronologically (e.g., event logs, transaction history)
  • Append-only tables where new data always represents newer time periods
  • Small-to-medium tables where full scans are fast regardless
πŸ“Natural Clustering Degrades Over Time

As tables grow and accumulate updates, deletes, and out-of-order inserts, natural clustering degrades. Micro-partitions that once contained data from a single time period become mixed. This is when you should evaluate whether a clustering key and Automatic Clustering are warranted.


5. Clustering Keys

A clustering key is one or more columns (or expressions) that you designate as the primary sort dimension for a table. When a clustering key is defined, Snowflake will re-sort data into new micro-partitions optimised for pruning on those columns.

Defining a Clustering Key

Clustering Key β€” CREATE TABLE and ALTER TABLE
1-- Option 1: Define clustering key at table creation
2CREATE TABLE sales (
3 sale_id NUMBER,
4 sale_date DATE,
5 region VARCHAR(50),
6 product_id NUMBER,
7 amount NUMBER(12,2)
8)
9CLUSTER BY (sale_date, region);
10
11-- Option 2: Add clustering key to an existing table
12ALTER TABLE sales CLUSTER BY (sale_date, region);
13
14-- Option 3: Use an expression as a clustering key
15-- Useful when you want to cluster on a derived value
16ALTER TABLE events CLUSTER BY (DATE_TRUNC('month', event_timestamp));
17
18-- Remove a clustering key entirely
19ALTER TABLE sales DROP CLUSTERING KEY;
20
21-- Suspend Automatic Clustering (keeps key definition, pauses background re-clustering)
22ALTER TABLE sales SUSPEND RECLUSTER;
23
24-- Resume Automatic Clustering
25ALTER TABLE sales RESUME RECLUSTER;
⚠️Composite Clustering Keys β€” Order Matters

When defining a composite clustering key with multiple columns, the order of columns matters. Snowflake clusters primarily by the first column, then by the second within each first-column bucket, and so on. Place the column with the highest cardinality and most selective queries first. Example: CLUSTER BY (region, sale_date) is optimal if queries almost always filter on region first.

When to Use Clustering Keys

Should You Define a Clustering Key?

Feature
Good Candidates
Poor Candidates
Table size
Multi-terabyte tables (hundreds of millions of rows+)
Small to medium tables (< 1 TB) β€” full scans are fast
Query patterns
Queries consistently filter on the same 1–2 columns
Ad-hoc queries with varied filter columns each time
Data mutation rate
Append-heavy or infrequently updated tables
Highly volatile tables with frequent updates/deletes (constant re-clustering is expensive)
Pruning effectiveness before
Partitions scanned β‰ˆ Partitions total in Query Profile
Already achieving high pruning from natural clustering
Column cardinality
Medium-to-high cardinality (dates, regions, categories)
Very low cardinality (boolean flags, Y/N columns) β€” poor selectivity
Cost consideration
High-value production tables where query savings justify serverless cost
Dev/test tables or tables with infrequent queries

6. Measuring Cluster Quality: Cluster Depth

Snowflake provides the SYSTEM$CLUSTERING_INFORMATION function to evaluate how well-clustered a table is on a given set of columns.

Checking Cluster Depth and Clustering Information
1-- Check clustering information for a table on a specific column set
2SELECT SYSTEM$CLUSTERING_INFORMATION('sales', '(sale_date, region)');
3
4-- Example output (JSON):
5-- {
6-- "cluster_by_keys" : "LINEAR(SALE_DATE, REGION)",
7-- "total_partition_count" : 1250,
8-- "total_constant_partition_count" : 42,
9-- "average_overlaps" : 3.14,
10-- "average_depth" : 2.87, ← Lower is better. 1.0 = perfectly clustered
11-- "partition_depth_histogram" : {
12-- "00000" : 42,
13-- "00001" : 418,
14-- "00002" : 561,
15-- "00003" : 229
16-- }
17-- }
18
19-- Key metrics to interpret:
20-- average_depth: ideally < 3. Higher means poor clustering (more overlap between micro-partitions)
21-- average_overlaps: how many other micro-partitions share overlapping key ranges
22-- total_constant_partition_count: micro-partitions with only one distinct key value (perfectly prunable)
23
24-- Check the AUTOMATIC_CLUSTERING_HISTORY view to see re-clustering activity
25SELECT *
26FROM SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY
27WHERE TABLE_NAME = 'SALES'
28ORDER BY START_TIME DESC
29LIMIT 10;
🎯Exam Tip: Cluster Depth Interpretation

The exam tests whether you can interpret cluster depth values. Remember: lower cluster depth = better clustering. A depth of 1 means each micro-partition’s key range is unique with no overlap β€” perfect pruning. A depth of 10 means a query filtering on a specific value must scan an average of 10 micro-partitions for every logical key value.


7. Automatic Clustering

Automatic Clustering is a Snowflake-managed serverless service that continuously monitors a clustered table and triggers background re-clustering when the cluster depth degrades beyond a threshold.

Setting Up Automatic Clustering on a Large Table

1
Evaluate whether clustering is needed

Run a representative query and check Query Profile. Compare 'Partitions scanned' to 'Partitions total'. Also run SYSTEM$CLUSTERING_INFORMATION to get a baseline cluster depth.

SQL
SELECT SYSTEM$CLUSTERING_INFORMATION('my_large_table', '(event_date, region)');
πŸ’‘Only proceed if the table is multi-TB and queries show poor pruning. Small tables do not need clustering keys.
2
Define the clustering key on the table

Use ALTER TABLE ... CLUSTER BY to add the clustering key. Choose columns that appear most frequently in WHERE clauses. This action alone enables Automatic Clustering.

SQL
ALTER TABLE my_large_table CLUSTER BY (event_date, region);
πŸ’‘Automatic Clustering begins immediately after defining the clustering key. Initial re-clustering of a large table may take hours and will consume serverless credits.
3
Monitor re-clustering progress

Query AUTOMATIC_CLUSTERING_HISTORY in SNOWFLAKE.ACCOUNT_USAGE to see when re-clustering jobs ran, how many credits were consumed, and how many bytes were processed.

SQL
SELECT
  table_name,
  start_time,
  end_time,
  credits_used,
  bytes_reclustered,
  partitions_reclustered
FROM SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY
WHERE table_name = 'MY_LARGE_TABLE'
ORDER BY start_time DESC;
πŸ’‘Credits consumed by Automatic Clustering appear as 'Serverless' credits in billing β€” they are NOT charged to your virtual warehouses.
4
Validate improvement with SYSTEM$CLUSTERING_INFORMATION

After re-clustering completes, re-run the clustering information function. You should see average_depth decrease. Re-run your representative query and compare Partitions scanned.

SQL
SELECT SYSTEM$CLUSTERING_INFORMATION('my_large_table', '(event_date, region)');
πŸ’‘If cluster depth is still high after re-clustering, consider whether your clustering key columns are the right choice, or whether query patterns require a different column order.
5
Optionally suspend clustering on non-production tables

If the table enters a read-only phase (e.g., archival), you can suspend Automatic Clustering to stop incurring serverless credits while preserving the clustering key definition.

SQL
ALTER TABLE my_large_table SUSPEND RECLUSTER;
πŸ’‘Suspended tables retain their clustering key. Resume with ALTER TABLE my_large_table RESUME RECLUSTER when needed.
⚠️Automatic Clustering Costs Serverless Credits

Automatic Clustering consumes serverless credits β€” not your warehouse credits. Costs depend on the amount of data re-clustered. Enabling clustering on a table with frequent random inserts and updates can generate significant ongoing costs. Always evaluate the query performance improvement vs. the re-clustering cost before enabling.


8. Table Types and Storage Costs

Snowflake has three table types with different storage cost profiles. Understanding the differences is critical for both the exam and real-world cost optimisation.

Snowflake Table Types: Storage Cost Comparison

Feature
Permanent Table
Transient Table
Time Travel retention
βœ“0–90 days (default 1 day; Enterprise edition up to 90 days)
0–1 day only (max 1 day regardless of edition)
Fail-safe period
βœ“7 days (no configuration β€” always 7 days for permanent tables)
0 days β€” NO fail-safe storage ever
Storage cost
Higher β€” pays for active + Time Travel + 7-day fail-safe storage
βœ“Lower β€” pays for active + up to 1-day Time Travel only
Use case
Production tables requiring disaster recovery and data auditing
Staging tables, ETL intermediaries, temporary result sets
UNDROP support
βœ“Yes β€” within Time Travel retention period
Yes β€” within up to 1-day retention period
Data recoverability
βœ“High β€” Snowflake support can recover data from fail-safe
Low β€” once Time Travel expires, data is gone permanently
ℹ️Temporary Tables

Temporary tables behave like transient tables (no fail-safe, up to 1-day Time Travel) but are additionally scoped to the session β€” they are automatically dropped when the session ends. They are ideal for intermediate query results within a single session and are not visible to other sessions.

Creating Transient and Temporary Tables
1-- Transient table: persists across sessions, no fail-safe
2CREATE TRANSIENT TABLE staging.order_staging (
3 order_id NUMBER,
4 raw_json VARIANT,
5 loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
6)
7DATA_RETENTION_TIME_IN_DAYS = 1; -- max for transient
8
9-- Temporary table: session-scoped, dropped automatically when session ends
10CREATE TEMPORARY TABLE session_work.calc_results (
11 customer_id NUMBER,
12 lifetime_value NUMBER(12,2)
13);
14
15-- Convert a permanent table to transient (must recreate)
16-- There is no ALTER TABLE ... SET TYPE = TRANSIENT in Snowflake
17-- You must CREATE TRANSIENT TABLE ... AS SELECT * FROM original_table
18
19-- Check current table type
20SHOW TABLES LIKE 'ORDER_STAGING' IN SCHEMA staging;
21-- Look at the "kind" column: TABLE (permanent), TRANSIENT, TEMPORARY

9. Storage Metrics: TABLE_STORAGE_METRICS

Snowflake provides the INFORMATION_SCHEMA.TABLE_STORAGE_METRICS view (and the SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS view for historical data) to inspect exactly how much storage each table is consuming across its three storage layers.

Querying TABLE_STORAGE_METRICS for Storage Breakdown
1-- Per-table storage breakdown (current state)
2SELECT
3 table_schema,
4 table_name,
5 table_type,
6 -- Active data currently in the table
7 ROUND(active_bytes / POWER(1024, 3), 2) AS active_gb,
8 -- Data retained for Time Travel (deleted/updated rows still accessible via AT/BEFORE)
9 ROUND(time_travel_bytes / POWER(1024, 3), 2) AS time_travel_gb,
10 -- Data retained in fail-safe (7 days for permanent, 0 for transient/temporary)
11 ROUND(failsafe_bytes / POWER(1024, 3), 2) AS failsafe_gb,
12 -- Total billed storage
13 ROUND((active_bytes + time_travel_bytes + failsafe_bytes) / POWER(1024, 3), 2) AS total_billed_gb
14FROM information_schema.table_storage_metrics
15WHERE active_bytes > 0
16ORDER BY total_billed_gb DESC;
17
18-- Find tables with the most fail-safe storage (potential cost saving opportunity)
19SELECT
20 table_name,
21 ROUND(failsafe_bytes / POWER(1024, 3), 2) AS failsafe_gb
22FROM information_schema.table_storage_metrics
23WHERE failsafe_bytes > 1073741824 -- > 1 GB in fail-safe
24ORDER BY failsafe_bytes DESC;
25
26-- Identify staging tables that should probably be TRANSIENT
27SELECT
28 table_name,
29 table_type,
30 ROUND(failsafe_bytes / POWER(1024, 3), 2) AS failsafe_gb
31FROM information_schema.table_storage_metrics
32WHERE table_name ILIKE '%staging%'
33AND table_type = 'BASE TABLE' -- permanent, not transient
34ORDER BY failsafe_bytes DESC;
🎯Exam Tip: Three Storage Cost Components

The exam frequently tests understanding of Snowflake’s three storage billing components. Remember: active_bytes + time_travel_bytes + failsafe_bytes = total billed. Transient tables have no failsafe_bytes. Temporary tables have neither failsafe_bytes nor cross-session persistence. Using transient tables for staging data is a common cost-optimisation strategy tested in scenario questions.


10. Columnar vs Row-Based Storage

Snowflake stores data in a columnar format. This is one of the most important architectural decisions underpinning its analytical performance. Understanding why columnar beats row-based storage for OLAP workloads is tested in the COF-C02 exam.

Columnar vs Row-Based Storage Layout

In row-based storage (left), all columns for a single row are stored contiguously. To read a single column across millions of rows, the database must scan every byte of every row. In columnar storage (right), all values for a single column are stored contiguously. Reading one column across millions of rows requires only a tiny fraction of the total I/O, and values within a column compress far more efficiently because they share the same data type and often similar values.

Side-by-side diagram. Left: rows stored as ROW1(col1,col2,col3,col4), ROW2(col1,col2,col3,col4)... Right: columns stored as COL1(r1,r2,r3...), COL2(r1,r2,r3...) β€” highlighting that analytical queries only touch relevant columns.

Columnar vs Row-Based Storage for Analytics

Feature
Columnar (Snowflake)
Row-Based (OLTP)
Analytical query I/O
βœ“Reads only the columns referenced in SELECT and WHERE β€” dramatically less I/O
Must read entire rows even when only 2 of 50 columns are needed
Compression ratio
βœ“High β€” values in the same column share data type and often similar range; run-length encoding and dictionary compression are highly effective
Lower β€” mixed data types within a row reduce compression effectiveness
Row-level INSERT/UPDATE speed
Slower β€” modifying a single row requires rewriting entire micro-partitions
βœ“Faster β€” individual row pointers allow in-place updates
Aggregation performance
βœ“Excellent β€” SUM/AVG/COUNT on one column reads only that column's data
Poor β€” must read all columns to access one
OLAP workload fit
βœ“Excellent fit β€” analytics, reporting, BI queries
Poor fit β€” OLAP on row-based systems requires column-store indexes
OLTP workload fit
Poor β€” not designed for high-frequency single-row transactions
βœ“Excellent β€” optimised for transactional INSERT/UPDATE/DELETE

Key Terms

πŸ“–

Data Storage & Clustering Key Terms

MP

Micro-partition

The fundamental unit of Snowflake storage. Each micro-partition is a contiguous slice of a table, stored in columnar format, automatically sized between 50–500 MB compressed, and immutable once written.

QUE

Query pruning

The process by which Snowflake's Cloud Services layer compares query predicate values to per-column min/max metadata and skips micro-partitions that cannot contain matching rows, reducing I/O.

CK

Clustering key

One or more columns (or expressions) designated as the primary sort dimension for a table. Enables Snowflake to re-organise data into micro-partitions that are optimally prunable for queries filtering on those columns.

CLU

Cluster depth

A metric returned by SYSTEM$CLUSTERING_INFORMATION indicating the average number of micro-partitions that contain overlapping key ranges. Lower values indicate better clustering. Ideal value is close to 1.

AC

Automatic Clustering

A Snowflake-managed serverless service that continuously monitors a clustered table's depth and triggers background re-clustering jobs when degradation is detected. Charged as serverless credits.

NDV

Number of Distinct Values

A per-column metadata statistic stored for each micro-partition indicating the cardinality (count of unique values) within that micro-partition. Part of the four key metadata fields tracked.

FAI

Fail-safe

A 7-day period following the Time Travel retention window during which Snowflake retains deleted or expired micro-partition data. Data can only be recovered by Snowflake Support. Only applies to permanent tables.

TRA

Transient table

A Snowflake table type with no fail-safe period and a maximum of 1 day of Time Travel retention. Lower storage cost than permanent tables. Suitable for staging and ETL intermediate data.

NAT

Natural clustering

The implicit ordering of data within micro-partitions based on insertion order, without any defined clustering key. Works well for chronologically loaded time-series data.

TSM

TABLE_STORAGE_METRICS

An INFORMATION_SCHEMA view that reports active_bytes, time_travel_bytes, and failsafe_bytes per table, enabling detailed analysis of storage costs across Snowflake table types.


Flashcard Review

Micro-Partitions
QUESTION

What is the compressed size range of a Snowflake micro-partition?

Click to reveal answer
ANSWER

50 MB to 500 MB compressed. Snowflake automatically determines the exact size β€” you cannot configure it. Micro-partitions are contiguous, immutable, and stored in columnar format.

Click to see question
Micro-Partition Metadata
QUESTION

What four metadata statistics does Snowflake store per column per micro-partition?

Click to reveal answer
ANSWER

1. Min value β€” smallest value in that column within the micro-partition. 2. Max value β€” largest value. 3. NDV (Number of Distinct Values) β€” cardinality within the micro-partition. 4. Null count β€” number of null values. This metadata is stored in the Cloud Services layer and used for query pruning.

Click to see question
Clustering Metrics
QUESTION

What does a cluster depth of 1.0 indicate versus a depth of 8.0?

Click to reveal answer
ANSWER

A cluster depth of 1.0 means each micro-partition has a unique, non-overlapping key range β€” perfect clustering. A depth of 8.0 means an average of 8 micro-partitions share overlapping key ranges for any given key value β€” poor clustering requiring more micro-partitions to be scanned per query.

Click to see question
Storage Billing
QUESTION

What are the three storage cost components billed by Snowflake for a permanent table?

Click to reveal answer
ANSWER

1. Active bytes β€” current live data. 2. Time Travel bytes β€” data retained within the Time Travel retention window (0–90 days for permanent tables). 3. Fail-safe bytes β€” data retained for 7 days after Time Travel expires. Transient tables have no fail-safe bytes. Visible via INFORMATION_SCHEMA.TABLE_STORAGE_METRICS.

Click to see question
Clustering Keys
QUESTION

Name two situations where you should NOT define a clustering key on a Snowflake table.

Click to reveal answer
ANSWER

1. Small tables (< 1 TB) β€” full micro-partition scans are fast enough that the overhead and cost of Automatic Clustering is not justified. 2. Highly volatile tables with frequent updates/deletes β€” constant DML triggers constant re-clustering, generating significant and continuous serverless credit consumption.

Click to see question

Practice Quizzes

Query Pruning

A data engineer notices that a query scanning a 5 TB SALES table reads 98% of all micro-partitions every time. The WHERE clause filters on SALE_DATE. What is the MOST likely cause?

Clustering Metrics

Which Snowflake function returns the average cluster depth of a table for a given set of columns?

Table Types

An organisation has a STAGING_ORDERS table used only as an ETL landing zone. Data is loaded, transformed, and moved to production tables within 24 hours. The table currently accumulates significant fail-safe storage costs. What table type change would ELIMINATE fail-safe costs?


Cheat Sheet

πŸ“‹
Quick Reference

Data Storage & Clustering β€” COF-C02 Quick Reference

πŸ—‚οΈ
Micro-Partition Facts
Size
50–500 MB compressedβ€” Automatic, not configurable
Format
Columnar (PAX)β€” Within each micro-partition
Mutability
Immutableβ€” DML creates new micro-partitions
Metadata location
Cloud Services layerβ€” Free to query β€” no compute cost
Metadata fields
Min, Max, NDV, Null countβ€” Per column, per micro-partition
πŸ”‘
Clustering Commands
Add clustering key
ALTER TABLE t CLUSTER BY (col)β€” Enables Automatic Clustering
Remove clustering key
ALTER TABLE t DROP CLUSTERING KEY
Suspend re-clustering
ALTER TABLE t SUSPEND RECLUSTERβ€” Keeps key, stops jobs
Check cluster depth
SYSTEM$CLUSTERING_INFORMATION('t','(col)')β€” Lower depth = better
View AC history
SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORYβ€” Credits, bytes, partitions
πŸ’Ύ
Table Type Storage
Permanent
Active + Time Travel (0–90d) + Fail-safe (7d)β€” Highest cost
Transient
Active + Time Travel (0–1d) + NO fail-safeβ€” Medium cost
Temporary
Active + Time Travel (0–1d) + NO fail-safeβ€” Session-scoped
Storage view
INFORMATION_SCHEMA.TABLE_STORAGE_METRICSβ€” active/tt/failsafe bytes
βœ…
When to Cluster
DO cluster
Multi-TB tables, consistent filter columns
DO NOT cluster
Small tables, highly volatile tables
Cluster cost type
Serverless creditsβ€” Not warehouse credits
Ideal cluster depth
Close to 1.0β€” Higher = worse pruning

Summary

Snowflake’s storage architecture is built on immutable, columnar micro-partitions that are automatically sized and created without any DBA intervention. The rich per-column metadata stored in the Cloud Services layer powers query pruning β€” the ability to skip entire micro-partitions without reading them β€” which is fundamental to Snowflake’s query performance.

Natural clustering works well for chronologically loaded time-series data, but large tables with varied query patterns benefit from explicitly defined clustering keys. Automatic Clustering maintains cluster quality continuously as a serverless background service, charged separately from virtual warehouse credits.

For the COF-C02 exam, internalise:

  • Micro-partition size: 50–500 MB compressed
  • Metadata tracked: min, max, NDV, null count per column per micro-partition
  • Cluster depth: lower is better, check with SYSTEM$CLUSTERING_INFORMATION
  • Clustering cost: serverless credits (not warehouse credits)
  • Table types: permanent (fail-safe 7d) vs transient (no fail-safe) vs temporary (session-scoped, no fail-safe)
  • Storage billing: active + time_travel + failsafe bytes, visible in INFORMATION_SCHEMA.TABLE_STORAGE_METRICS

Reinforce what you just read

Study the All flashcards with spaced repetition to lock it in.

Study flashcards β†’