Skip to main content

Introduction to Bulk Data Loading

Bulk loading is the process of efficiently loading large volumes of data into Snowflake tables from external sources. Snowflake’s COPY INTO command provides high-performance, parallel data loading capabilities.


Loading Data: The Big Picture

Data Loading Architecture

End-to-end flow showing external data sources β†’ stages β†’ COPY INTO β†’ Snowflake tables

Data loading architecture showing cloud storage, stages, file formats, and target tables

The typical bulk loading workflow:

  1. Prepare - Format and stage your data files
  2. Create - Define file formats and stages
  3. Load - Use COPY INTO to load data
  4. Validate - Verify and handle errors
  5. Optimise - Monitor and improve performance
Data Loading
QUESTION

What is the primary command for bulk loading data into Snowflake?

Click to reveal answer
ANSWER

COPY INTO. This command loads data from staged files into existing tables with high performance and automatic parallelisation.

Click to see question

Understanding Stages

Stages are locations where data files are stored before loading into Snowflake. There are three types:

1. User Stages (@~)

Personal staging area for each user.

Using User Stages
1-- Upload file to user stage (via SnowSQL or Web UI)
2PUT file:///tmp/sales_data.csv @~;
3
4-- List files in user stage
5LIST @~;
6
7-- Load from user stage
8COPY INTO sales_table
9FROM @~/sales_data.csv
10FILE_FORMAT = (TYPE = 'CSV');
11
12-- Remove file from user stage
13REMOVE @~/sales_data.csv;

2. Table Stages (@%table_name)

Automatically created for each table.

Using Table Stages
1-- Upload to table stage
2PUT file:///tmp/orders.csv @%orders_table;
3
4-- List files in table stage
5LIST @%orders_table;
6
7-- Load from table stage
8COPY INTO orders_table
9FROM @%orders_table
10FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"');

3. Named Stages (Internal or External)

Created explicitly and can be shared across the account.

Creating Named Stages
1-- Internal named stage
2CREATE STAGE my_internal_stage;
3
4-- External stage (AWS S3)
5CREATE STAGE my_s3_stage
6URL = 's3://mybucket/data/'
7CREDENTIALS = (AWS_KEY_ID = 'xxx' AWS_SECRET_KEY = 'yyy');
8
9-- External stage (Azure Blob)
10CREATE STAGE my_azure_stage
11URL = 'azure://myaccount.blob.core.windows.net/mycontainer/'
12CREDENTIALS = (AZURE_SAS_TOKEN = 'zzz');
13
14-- External stage (GCP)
15CREATE STAGE my_gcs_stage
16URL = 'gcs://mybucket/path/'
17STORAGE_INTEGRATION = my_gcp_integration;

Stage Types Comparison

Visual comparison of User, Table, and Named stages with their scope and use cases

Diagram comparing the three types of Snowflake stages
Stages
QUESTION

What's the difference between internal and external stages?

Click to reveal answer
ANSWER

Internal stages store data within Snowflake's cloud storage. External stages reference data in your own cloud storage (S3, Azure Blob, GCS) without moving it into Snowflake.

Click to see question

File Formats

File formats define how Snowflake interprets the structure and content of your data files.

Supported File Types

FormatUse CaseKey Benefits
CSVDelimited text filesUniversal compatibility, simple
JSONSemi-structured dataFlexible schema, nested data
AvroBinary formatCompact, schema evolution
ParquetColumnar formatHigh compression, fast queries
ORCOptimised row columnarEfficient for big data workloads
XMLHierarchical dataLegacy systems integration

Creating File Formats

CSV File Format Examples
1-- Basic CSV format
2CREATE FILE FORMAT csv_format
3TYPE = 'CSV'
4FIELD_DELIMITER = ','
5SKIP_HEADER = 1
6FIELD_OPTIONALLY_ENCLOSED_BY = '"'
7TRIM_SPACE = TRUE
8ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
9NULL_IF = ('NULL', 'null', '');
10
11-- CSV with custom delimiter
12CREATE FILE FORMAT pipe_delimited_format
13TYPE = 'CSV'
14FIELD_DELIMITER = '|'
15RECORD_DELIMITER = '\n'
16SKIP_HEADER = 1;
17
18-- Tab-delimited format
19CREATE FILE FORMAT tsv_format
20TYPE = 'CSV'
21FIELD_DELIMITER = '\t'
22COMPRESSION = 'GZIP';
JSON File Format Examples
1-- Basic JSON format
2CREATE FILE FORMAT json_format
3TYPE = 'JSON'
4COMPRESSION = 'AUTO'
5STRIP_OUTER_ARRAY = TRUE;
6
7-- JSON with specific date format
8CREATE FILE FORMAT json_with_dates
9TYPE = 'JSON'
10DATE_FORMAT = 'YYYY-MM-DD'
11TIMESTAMP_FORMAT = 'YYYY-MM-DD HH24:MI:SS';
Parquet Format
1-- Parquet format (binary, no delimiters needed)
2CREATE FILE FORMAT parquet_format
3TYPE = 'PARQUET'
4COMPRESSION = 'SNAPPY';
File Formats
QUESTION

Why would you choose Parquet over CSV for large datasets?

Click to reveal answer
ANSWER

Parquet offers better compression (smaller files), faster query performance (columnar format), and built-in schema definition. It's ideal for large analytical workloads.

Click to see question

The COPY INTO Command

The COPY INTO command loads data from staged files into tables.

Basic Syntax

Basic COPY INTO Syntax
1COPY INTO target_table
2FROM @stage_name/path/
3FILE_FORMAT = (FORMAT_NAME = 'format_name')
4PATTERN = '.*\.csv'
5ON_ERROR = 'CONTINUE';

Comprehensive Example

Complete Data Loading Example
1-- 1. Create target table
2CREATE TABLE customer_orders (
3order_id INTEGER,
4customer_id INTEGER,
5order_date DATE,
6amount DECIMAL(10,2),
7status VARCHAR(50),
8product_category VARCHAR(100)
9);
10
11-- 2. Create file format
12CREATE FILE FORMAT csv_orders_format
13TYPE = 'CSV'
14FIELD_DELIMITER = ','
15SKIP_HEADER = 1
16FIELD_OPTIONALLY_ENCLOSED_BY = '"'
17NULL_IF = ('NULL', 'null', '')
18ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
19COMPRESSION = 'AUTO';
20
21-- 3. Create named stage
22CREATE STAGE orders_stage
23FILE_FORMAT = csv_orders_format;
24
25-- 4. Load data with error handling
26COPY INTO customer_orders
27FROM @orders_stage/2024/
28FILE_FORMAT = (FORMAT_NAME = 'csv_orders_format')
29PATTERN = '.*orders.*\.csv\.gz'
30ON_ERROR = 'SKIP_FILE_5%' -- Skip file if >5% errors
31PURGE = TRUE -- Delete files after successful load
32FORCE = FALSE; -- Don't reload previously loaded files
33
34-- 5. Verify the load
35SELECT COUNT(*) FROM customer_orders;

COPY INTO Execution Flow

Step-by-step visualisation of how COPY INTO processes files: validation β†’ parallelisation β†’ loading β†’ error handling

COPY INTO command execution flow diagram

Error Handling Options

Error Handling Strategies
1-- Continue on errors (default)
2COPY INTO table1 FROM @stage1
3ON_ERROR = 'CONTINUE';
4
5-- Skip entire file if any error
6COPY INTO table1 FROM @stage1
7ON_ERROR = 'SKIP_FILE';
8
9-- Skip file if more than N errors
10COPY INTO table1 FROM @stage1
11ON_ERROR = 'SKIP_FILE_5'; -- Skip if >5 errors
12
13-- Skip file if more than N% errors
14COPY INTO table1 FROM @stage1
15ON_ERROR = 'SKIP_FILE_10%'; -- Skip if >10% error rate
16
17-- Abort on first error
18COPY INTO table1 FROM @stage1
19ON_ERROR = 'ABORT_STATEMENT';
COPY INTO
QUESTION

What does the PURGE option do in COPY INTO?

Click to reveal answer
ANSWER

PURGE = TRUE automatically deletes successfully loaded files from the stage after loading completes. This helps manage storage costs and prevents reloading the same data.

Click to see question

Advanced Loading Techniques

Column Mapping and Transformation

Loading with Transformations
1-- Load specific columns with transformations
2COPY INTO customer_orders (
3order_id,
4customer_id,
5order_date,
6amount,
7status
8)
9FROM (
10SELECT
11 $1::INTEGER, -- Column 1 β†’ order_id
12 $2::INTEGER, -- Column 2 β†’ customer_id
13 TO_DATE($3, 'YYYY-MM-DD'), -- Column 3 β†’ order_date (converted)
14 $4::DECIMAL(10,2), -- Column 4 β†’ amount
15 UPPER($5) -- Column 5 β†’ status (uppercase)
16FROM @orders_stage/raw/
17)
18FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1)
19PATTERN = '.*\.csv';

Loading JSON Data

Loading Semi-Structured JSON
1-- Create table with VARIANT column
2CREATE TABLE user_events (
3event_id INTEGER,
4event_data VARIANT,
5loaded_at TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
6);
7
8-- Load JSON data
9COPY INTO user_events (event_id, event_data)
10FROM (
11SELECT
12 $1:event_id::INTEGER,
13 $1 -- Entire JSON object
14FROM @json_stage/events/
15)
16FILE_FORMAT = (TYPE = 'JSON')
17PATTERN = '.*\.json';
18
19-- Query JSON data
20SELECT
21event_id,
22event_data:user_id::INTEGER AS user_id,
23event_data:event_type::STRING AS event_type,
24event_data:properties AS properties
25FROM user_events;

Loading from Multiple Files

Pattern Matching for Multiple Files
1-- Load all CSV files from a path
2COPY INTO orders
3FROM @my_stage/2024/01/
4PATTERN = '.*\.csv';
5
6-- Load files matching specific pattern
7COPY INTO orders
8FROM @my_stage/
9PATTERN = '.*orders_202401.*\.csv\.gz';
10
11-- Load specific list of files
12COPY INTO orders
13FROM @my_stage/
14FILES = ('file1.csv', 'file2.csv', 'file3.csv');
Advanced Loading
QUESTION

How do you reference columns in a CSV file during COPY INTO with transformations?

Click to reveal answer
ANSWER

Use positional notation: $1 for first column, $2 for second, etc. You can cast and transform these: $1::INTEGER, UPPER($2), TO_DATE($3, 'format').

Click to see question

Monitoring and Troubleshooting

View Load History

Monitoring Data Loads
1-- View recent load history
2SELECT *
3FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
4TABLE_NAME => 'CUSTOMER_ORDERS',
5START_TIME => DATEADD(HOUR, -24, CURRENT_TIMESTAMP())
6))
7ORDER BY LAST_LOAD_TIME DESC;
8
9-- Check for errors
10SELECT
11FILE_NAME,
12STATUS,
13ERROR_COUNT,
14ERROR_LIMIT,
15FIRST_ERROR
16FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
17TABLE_NAME => 'CUSTOMER_ORDERS',
18START_TIME => DATEADD(DAY, -7, CURRENT_TIMESTAMP())
19))
20WHERE STATUS != 'LOADED'
21ORDER BY LAST_LOAD_TIME DESC;

Validate Before Loading

Validation Techniques
1-- Test load with limited rows
2COPY INTO customer_orders
3FROM @orders_stage/
4FILE_FORMAT = (FORMAT_NAME = 'csv_orders_format')
5VALIDATION_MODE = 'RETURN_ERRORS'; -- See errors without loading
6
7-- Return first N errors
8COPY INTO customer_orders
9FROM @orders_stage/
10FILE_FORMAT = (FORMAT_NAME = 'csv_orders_format')
11VALIDATION_MODE = 'RETURN_5_ROWS'; -- Return first 5 error rows

Load Monitoring Dashboard

Example monitoring view showing load status, error rates, throughput, and file processing metrics

Data loading monitoring dashboard with key metrics

Best Practices

1. File Sizing

  • Optimal file size: 100-250 MB compressed
  • Avoid: Files < 10 MB (overhead) or > 5 GB (reduced parallelism)
  • Split large files into smaller chunks for better performance
Best Practices
QUESTION

What is the optimal file size for bulk loading in Snowflake?

Click to reveal answer
ANSWER

100-250 MB compressed per file. This balances parallelism and overhead. Avoid very small files (<10 MB) and very large files (>5 GB).

Click to see question

2. Compression

Always compress your files before loading:

  • gzip - Good compression, widely supported
  • Snappy - Faster decompression (Parquet)
  • Brotli - Higher compression ratio
Handling Compressed Files
1-- Snowflake auto-detects compression
2COPY INTO orders
3FROM @my_stage/
4FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'AUTO')
5PATTERN = '.*\.csv\.gz';
6
7-- Explicitly specify compression
8COPY INTO orders
9FROM @my_stage/
10FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'GZIP');

3. Use Dedicated Warehouse

Dedicated Loading Warehouse
1-- Create warehouse for data loading
2CREATE WAREHOUSE LOADING_WH
3WITH WAREHOUSE_SIZE = 'LARGE'
4AUTO_SUSPEND = 60
5AUTO_RESUME = TRUE
6INITIALLY_SUSPENDED = TRUE;
7
8-- Use it for loading
9USE WAREHOUSE LOADING_WH;
10
11COPY INTO large_table
12FROM @data_stage/
13FILE_FORMAT = (FORMAT_NAME = 'my_format');

4. Partition Data by Date

Organised Stage Structure
1-- Organize files by date in stage
2-- s3://mybucket/data/2024/01/15/orders_*.csv.gz
3-- YYYY/MM/DD/
4
5-- Load specific date range
6COPY INTO orders
7FROM @my_stage/2024/01/15/
8FILE_FORMAT = (FORMAT_NAME = 'csv_format')
9PATTERN = '.*\.csv\.gz';

5. Enable PURGE for Production

Remove loaded files automatically to save costs:

Auto-Cleanup After Load
1COPY INTO orders
2FROM @my_stage/
3FILE_FORMAT = (FORMAT_NAME = 'csv_format')
4PURGE = TRUE; -- Delete files after successful load
Best Practices
QUESTION

Why should you use a dedicated warehouse for large data loads?

Click to reveal answer
ANSWER

A dedicated warehouse prevents data loading operations from competing with query workloads, ensures predictable performance, and allows independent scaling and monitoring.

Click to see question

Common Patterns

Incremental Loading Pattern

Daily Incremental Load Process
1-- 1. Create or replace transient staging table
2CREATE OR REPLACE TRANSIENT TABLE orders_staging LIKE orders;
3
4-- 2. Load into staging
5COPY INTO orders_staging
6FROM @orders_stage/{{ current_date }}/
7FILE_FORMAT = (FORMAT_NAME = 'csv_orders')
8PURGE = TRUE;
9
10-- 3. Deduplicate and merge
11MERGE INTO orders o
12USING (
13SELECT * FROM orders_staging
14QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY loaded_at DESC) = 1
15) s
16ON o.order_id = s.order_id
17WHEN MATCHED THEN UPDATE SET
18o.amount = s.amount,
19o.status = s.status
20WHEN NOT MATCHED THEN INSERT VALUES (
21s.order_id, s.customer_id, s.order_date, s.amount, s.status
22);
23
24-- 4. Clean up staging
25DROP TABLE orders_staging;

Full Refresh Pattern

Full Table Refresh
1-- 1. Create new table
2CREATE OR REPLACE TABLE orders_new LIKE orders;
3
4-- 2. Load all data
5COPY INTO orders_new
6FROM @orders_stage/full_export/
7FILE_FORMAT = (FORMAT_NAME = 'csv_orders')
8PURGE = TRUE;
9
10-- 3. Swap tables (zero-downtime)
11ALTER TABLE orders RENAME TO orders_old;
12ALTER TABLE orders_new RENAME TO orders;
13
14-- 4. Verify and drop old table
15-- SELECT COUNT(*) FROM orders;
16DROP TABLE orders_old;

Practice Questions

Exam Prep
QUESTION

Which stage type is automatically created for every table?

Click to reveal answer
ANSWER

Table stage (@%table_name). It's created automatically when a table is created and can only be used for that specific table.

Click to see question
Exam Prep
QUESTION

What does VALIDATION_MODE = 'RETURN_ERRORS' do?

Click to reveal answer
ANSWER

It validates the COPY INTO operation and returns all errors without actually loading data. Useful for testing before running the actual load.

Click to see question
Exam Prep
QUESTION

What's the default value for ON_ERROR in COPY INTO?

Click to reveal answer
ANSWER

ABORT_STATEMENT. The load operation will fail and rollback if any error is encountered. This is the safest but least flexible option.

Click to see question
Exam Prep
QUESTION

Can you transform data during the COPY INTO operation?

Click to reveal answer
ANSWER

YES. You can use a SELECT subquery in the FROM clause to apply transformations, cast data types, rename columns, and filter rows during the load.

Click to see question
Exam Prep
QUESTION

What happens if you run COPY INTO twice with the same files and FORCE = FALSE?

Click to reveal answer
ANSWER

The second run loads nothing. Snowflake tracks loaded files for 64 days and skips them unless FORCE = TRUE is specified.

Click to see question

Additional Resources

Official Documentation

Best Practice Guides

  • Snowflake Data Loading Best Practices
  • Optimising Bulk Loads
  • Error Handling Strategies

Summary

Key Takeaways:

βœ… Use stages to organise files before loading βœ… Define file formats for reusability and consistency βœ… COPY INTO is the primary bulk loading command βœ… Compress files (100-250 MB each) for optimal performance βœ… Use ON_ERROR and VALIDATION_MODE to handle errors βœ… PURGE = TRUE to auto-delete loaded files βœ… Monitor loads with COPY_HISTORY information schema βœ… Use dedicated warehouses for large loads


Next Steps

Continue your learning journey:

Reinforce what you just read

Study the All flashcards with spaced repetition to lock it in.

Study flashcards β†’