Back to Secure, Monitor, and Optimize Data Storage and Data Processing

Small File Compaction

5 minutes 5 Questions

Small File Compaction is a critical optimization technique in Azure data engineering, particularly relevant when working with distributed storage systems like Azure Data Lake Storage (ADLS) and processing engines such as Apache Spark and Azure Databricks. It addresses the 'small file problem,' whic…

Small File Compaction in Azure Data Engineering (DP-203)

Small File Compaction: A Comprehensive Guide for DP-203

Why Small File Compaction Is Important

In distributed data systems like Azure Data Lake Storage, Apache Spark, and Delta Lake, data is often written in parallel across multiple nodes. This parallelism is great for throughput but frequently results in a large number of small files being created. This phenomenon is commonly referred to as the small file problem.

Small files are problematic for several reasons:

• Degraded Read Performance: Each file requires metadata operations (opening, reading headers, closing). When thousands or millions of small files exist, the overhead of these operations becomes significant, drastically slowing down queries.

• Increased Storage Costs: Small files lead to inefficient storage utilization. Metadata overhead per file adds up, and storage systems are optimized for larger, sequential reads.

• Strain on the Metastore: Catalog services like the Hive Metastore or Unity Catalog must track every file. A large number of small files increases metastore pressure and slows down table listing and planning operations.

• Slower Job Execution: Spark and similar engines create one task per file (or per partition). Too many small files means too many tasks, leading to excessive scheduling overhead and poor cluster utilization.

• Higher API Costs: In cloud storage like Azure Data Lake Storage Gen2, each file operation (list, read, etc.) incurs API call costs. More files mean more API calls and higher bills.

What Is Small File Compaction?

Small file compaction is the process of combining multiple small files into fewer, larger files to improve query performance, reduce storage overhead, and optimize resource utilization. It is essentially a maintenance operation that reorganizes how data is physically stored without changing the logical content of the data.

In the Azure ecosystem, small file compaction is most commonly associated with:

• Delta Lake's OPTIMIZE command
• Apache Spark repartitioning and coalescing
• Azure Synapse Analytics pipeline activities
• Auto-compaction features in Delta Lake

How Small File Compaction Works

1. Delta Lake OPTIMIZE Command

The most common and exam-relevant method of small file compaction in the DP-203 context is the OPTIMIZE command in Delta Lake.

Syntax:
OPTIMIZE delta_table_name;

Or with a WHERE clause to target specific partitions:
OPTIMIZE delta_table_name WHERE date = '2024-01-01';

When you run OPTIMIZE:
• Delta Lake reads all the small files in the table (or targeted partitions).
• It combines them into larger files, typically targeting a size of around 1 GB per file (this is the default target file size, though it can be configured).
• The old small files are not immediately deleted; they are marked for removal and cleaned up later via the VACUUM command.
• The transaction log is updated to reflect the new file layout.

2. OPTIMIZE with Z-ORDER

You can combine compaction with data co-location using Z-Ordering:

OPTIMIZE delta_table_name ZORDER BY (column_name);

Z-Ordering physically co-locates related data in the same files, which improves data skipping and query performance for filters on the Z-Ordered columns. This is a form of compaction + data layout optimization.

3. Auto Compaction in Delta Lake

Delta Lake supports auto compaction, which automatically triggers compaction after writes. This can be enabled with the table property:

delta.autoOptimize.autoCompact = true

When enabled, a lightweight OPTIMIZE operation runs after each write, compacting small files without manual intervention. This is particularly useful for streaming workloads where many small files are generated continuously.

4. Optimized Writes

Delta Lake also offers optimized writes, which is a complementary feature:

delta.autoOptimize.optimizeWrite = true

Optimized writes attempt to write larger files during the write operation itself, reducing the number of small files created in the first place. This is a preventive measure, whereas OPTIMIZE is a corrective measure.

5. Spark Repartition and Coalesce

Before writing data, you can control file output size using Spark operations:

• dataframe.repartition(n) — Shuffles data into n partitions (causes a full shuffle).
• dataframe.coalesce(n) — Reduces the number of partitions without a full shuffle (more efficient but can only reduce, not increase, partition count).

These are useful when writing to non-Delta formats (Parquet, CSV, etc.) where OPTIMIZE is not available.

6. VACUUM for Cleanup

After compaction, old small files remain on storage. The VACUUM command removes files that are no longer referenced by the Delta transaction log and are older than a retention threshold (default 7 days):

VACUUM delta_table_name RETAIN 168 HOURS;

VACUUM is a critical companion to OPTIMIZE — without it, old small files continue to consume storage.

Key Concepts to Remember

• OPTIMIZE compacts small files into larger ones in Delta Lake tables.
• Z-ORDER can be used with OPTIMIZE for co-locating data for better query performance.
• Auto compaction runs compaction automatically after writes.
• Optimized writes prevent small files from being created in the first place.
• VACUUM cleans up old, unreferenced files after compaction.
• Coalesce is preferred over repartition when reducing file count (less shuffling).
• Repartition is preferred when you need an exact number of output files or need to increase the partition count.
• The default target file size for OPTIMIZE is approximately 1 GB.
• OPTIMIZE is an idempotent operation — running it multiple times has no negative side effects.
• Small file compaction does not change the data; it only changes the physical file layout.

When Does the Small File Problem Occur?

• Streaming ingestion: Micro-batches in Structured Streaming create many small files.
• Frequent appends: ETL jobs that append small batches of data repeatedly.
• High-parallelism writes: Spark jobs with many partitions writing to the same table.
• Over-partitioned tables: Partitioning by a high-cardinality column creates many directories with small files.

Exam Tips: Answering Questions on Small File Compaction

Tip 1: Know the OPTIMIZE Command
If a question describes slow query performance on a Delta Lake table and mentions that data is ingested frequently in small batches, the answer is almost always OPTIMIZE. This is the primary compaction mechanism for Delta Lake.

Tip 2: Distinguish Between OPTIMIZE and VACUUM
OPTIMIZE compacts files. VACUUM deletes old files. They serve different purposes. If a question asks about reclaiming storage space, think VACUUM. If it asks about improving read performance, think OPTIMIZE.

Tip 3: Know When to Use Z-ORDER
If a question mentions that queries frequently filter on specific columns and performance is poor, the answer likely involves OPTIMIZE with Z-ORDER BY. Z-ORDER is about data co-location for better data skipping, not just compaction.

Tip 4: Understand Auto Compaction vs. Optimized Writes
Auto compaction runs AFTER writes to merge small files. Optimized writes work DURING writes to produce fewer, larger files. If the question asks about preventing small files, the answer is optimized writes. If it asks about fixing existing small files, the answer is auto compaction or OPTIMIZE.

Tip 5: Coalesce vs. Repartition
If the question is about reducing the number of output files when writing a DataFrame and minimizing shuffle overhead, choose coalesce. If the question requires redistributing data evenly or increasing partition count, choose repartition.

Tip 6: Watch for Streaming Scenarios
Streaming jobs are a classic source of the small file problem. If a question describes a Structured Streaming job with degraded downstream query performance, think about enabling auto compaction or scheduling periodic OPTIMIZE jobs.

Tip 7: Partition Pruning and File Compaction Are Different
Don't confuse partition pruning (skipping entire partitions based on query filters) with file compaction (merging small files). Some questions may try to conflate these concepts. Compaction helps within a partition; partition pruning helps across partitions.

Tip 8: Remember the Relationship with Data Skipping
Delta Lake uses file-level statistics (min/max values) for data skipping. When files are very small, data skipping is less effective because the statistical ranges per file are narrow but numerous. After compaction (especially with Z-ORDER), data skipping becomes much more effective because related data is co-located.

Tip 9: Know the Azure Synapse Context
In Azure Synapse dedicated SQL pools, the concept is slightly different — instead of small files, you deal with fragmented distributions and rowgroups in columnstore indexes. The equivalent maintenance operation is ALTER INDEX REBUILD or ALTER INDEX REORGANIZE. Be aware of which storage engine the question references.

Tip 10: OPTIMIZE Does Not Affect Data Integrity
OPTIMIZE is a safe, non-destructive operation. It does not change data content, only physical file layout. It respects Delta Lake's ACID transactions. If a question implies that compaction might cause data loss, that answer is incorrect.

Summary Table

Problem → Solution
• Many small files in Delta table → OPTIMIZE
• Slow queries with column filters → OPTIMIZE ZORDER BY
• Streaming creates small files → Auto compaction (autoCompact)
• Prevent small files during write → Optimized writes (optimizeWrite)
• Old files consuming storage → VACUUM
• Reduce output file count in Spark → coalesce() or repartition()
• Fragmented columnstore in Synapse → ALTER INDEX REBUILD

Understanding small file compaction is essential for the DP-203 exam because it directly relates to performance optimization, cost management, and data maintenance — three pillars of the Secure, Monitor, and Optimize Data section of the exam.

Test mode:

Exam (Timed)

Practice (With explanations)

Start practice test

Engineer Data Solutions on Azure

DP-203 data storage, processing & security

Data Storage Design: Data Lake, Synapse, Cosmos DB, and SQL Database
Data Processing: Data Factory, Databricks, Stream Analytics, and HDInsight
Data Security: Encryption, masking, access control, and data governance
100% Satisfaction Guaranteed: Full refund if unsatisfied
Risk-Free: 7-day free trial with all premium features!