The COPY INTO command is Snowflake's primary method for bulk loading data from external sources into tables. This powerful command enables efficient data ingestion from staged files located in internal stages, external stages (AWS S3, Azure Blob Storage, Google Cloud Storage), or named stages.
The…The COPY INTO command is Snowflake's primary method for bulk loading data from external sources into tables. This powerful command enables efficient data ingestion from staged files located in internal stages, external stages (AWS S3, Azure Blob Storage, Google Cloud Storage), or named stages.
The basic syntax follows: COPY INTO <table_name> FROM <stage_location>. The command supports various file formats including CSV, JSON, Avro, Parquet, ORC, and XML. You specify the format using the FILE_FORMAT parameter, either referencing a named file format object or defining format options inline.
Key parameters include:
1. FILES - Specifies exact file names to load
2. PATTERN - Uses regex patterns to match specific files
3. ON_ERROR - Controls behavior when errors occur (CONTINUE, SKIP_FILE, ABORT_STATEMENT)
4. VALIDATION_MODE - Validates data before loading using RETURN_ERRORS or RETURN_n_ROWS
5. PURGE - Automatically removes files after successful loading when set to TRUE
6. FORCE - Reloads files even if previously loaded
Snowflake maintains load metadata for 64 days, preventing duplicate data loading from the same files. The COPY INTO command processes files in parallel, leveraging Snowflake's multi-cluster architecture for optimal performance.
Transformation capabilities allow column reordering, omitting columns, casting data types, and applying functions during the load process. You can select specific columns using SELECT statements within the COPY command.
Best practices include:
- Organizing data files between 100-250 MB compressed for optimal parallel processing
- Using appropriate file formats for your data structure
- Implementing error handling strategies based on data quality requirements
- Utilizing stages for secure and organized file management
The COPY INTO command tracks which files have been loaded through metadata, ensuring data integrity and preventing accidental duplicate loads. This makes it ideal for batch processing scenarios where reliability and performance are critical requirements.
COPY INTO Command for Bulk Loading in Snowflake
Why COPY INTO for Bulk Loading is Important
The COPY INTO command is the primary method for loading large volumes of data into Snowflake tables. Understanding this command is essential because it offers the highest throughput for data ingestion, supports multiple file formats, and provides robust error handling capabilities. For the SnowPro Core exam, this topic frequently appears as it represents a fundamental operation in Snowflake.
What is COPY INTO for Bulk Loading?
COPY INTO is a SQL command that loads data from staged files into a Snowflake table. It can load data from: - Internal stages (user, table, or named stages) - External stages (Amazon S3, Azure Blob Storage, Google Cloud Storage) - External locations using cloud storage URLs
The command supports multiple file formats including CSV, JSON, Avro, ORC, Parquet, and XML.
How COPY INTO Works
1. Stage the Data: Files must first be placed in a stage (internal or external)
2. Execute COPY INTO: The command reads files from the stage and loads them into the target table
3. Metadata Tracking: Snowflake maintains metadata for 64 days to prevent reloading the same files
4. Parallel Processing: Large files are automatically split and loaded in parallel using multiple compute resources
Key Syntax:
COPY INTO table_name FROM @stage_name FILE_FORMAT = (TYPE = CSV) PATTERN = '.*[.]csv' ON_ERROR = 'CONTINUE';
Important Parameters:
- FILES: Specify exact file names to load - PATTERN: Use regex to match file names - FILE_FORMAT: Define format type and options - ON_ERROR: Controls behavior when errors occur (CONTINUE, SKIP_FILE, ABORT_STATEMENT) - VALIDATION_MODE: Validates data before actual loading - FORCE: Reloads files even if previously loaded (bypasses metadata check) - PURGE: Automatically deletes files after successful load
Load History and Duplicate Prevention
Snowflake tracks loaded files using metadata stored for 64 days. This prevents accidental reloading of the same files. To reload files within this period, use the FORCE = TRUE option.
Error Handling Options
- ABORT_STATEMENT (default): Stops loading when first error occurs - CONTINUE: Skips problematic rows and continues loading - SKIP_FILE: Skips entire file when error threshold is reached - SKIP_FILE_n: Skips file after n errors - SKIP_FILE_n%: Skips file after n% of rows have errors
Exam Tips: Answering Questions on COPY INTO Command for Bulk Loading
1. Remember the 64-day metadata retention: Questions often ask about duplicate file prevention and how long Snowflake tracks loaded files
2. Know the difference between FILES and PATTERN: FILES uses exact names while PATTERN uses regex expressions
3. Understand ON_ERROR options: Be able to identify which option to use for different scenarios
4. VALIDATION_MODE does not load data: It only validates and returns errors - this is a common exam question
5. FORCE = TRUE bypasses metadata: Remember this is required to reload previously loaded files
6. File format precedence: Inline options override named file format options, which override stage file format options
7. Supported file formats: Know that CSV, JSON, Avro, ORC, Parquet, and XML are supported
8. PURGE option: Understand that setting PURGE = TRUE deletes staged files after successful loading
9. Transformations during load: COPY INTO supports column reordering, column omission, and simple transformations using SELECT statements