Open Table Formats with Apache Iceberg
Open Table Formats (OTFs) are metadata layers that sit on top of file formats like Parquet and ORC in data lakes, enabling database-like capabilities such as ACID transactions, schema evolution, and time travel. Apache Iceberg is one of the most prominent OTFs, alongside Apache Hudi and Delta Lake.… Open Table Formats (OTFs) are metadata layers that sit on top of file formats like Parquet and ORC in data lakes, enabling database-like capabilities such as ACID transactions, schema evolution, and time travel. Apache Iceberg is one of the most prominent OTFs, alongside Apache Hudi and Delta Lake. Apache Iceberg is an open-source table format originally developed at Netflix and later donated to the Apache Software Foundation. It addresses key limitations of traditional data lake architectures by providing a robust metadata management layer. **Key Features of Apache Iceberg:** 1. **ACID Transactions**: Iceberg supports atomic, consistent, isolated, and durable transactions, ensuring reliable concurrent reads and writes without data corruption. 2. **Schema Evolution**: You can add, drop, rename, or reorder columns without rewriting existing data, making schema changes safe and backward-compatible. 3. **Partition Evolution**: Unlike Hive-style partitioning, Iceberg allows you to change partitioning strategies without rewriting data. This is called hidden partitioning, where users don't need to know the partition layout. 4. **Time Travel**: Iceberg maintains snapshots of table states, enabling queries against historical versions of data for auditing, debugging, or rollback purposes. 5. **File-Level Metadata Tracking**: Iceberg tracks individual data files through manifest files and manifest lists, enabling efficient query planning by pruning unnecessary files. **AWS Integration:** In the AWS ecosystem, Iceberg integrates seamlessly with services like AWS Glue, Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. AWS Glue Data Catalog serves as the Iceberg catalog, managing table metadata. AWS Glue ETL jobs natively support reading and writing Iceberg tables, enabling features like compaction, snapshot management, and upserts. **Why It Matters for Data Engineers:** Iceberg solves the 'small files problem' through compaction, supports merge-on-read and copy-on-write strategies for updates/deletes, and provides reliable lakehouse architecture capabilities. For the AWS Data Engineer exam, understanding how Iceberg enables transactional data lakes on S3 is essential.
Open Table Formats with Apache Iceberg – A Comprehensive Guide for the AWS Data Engineer Associate Exam
Why Open Table Formats and Apache Iceberg Matter
Traditional data lake storage formats (such as plain Parquet or ORC files organized in Hive-style partitions) suffer from several well-known limitations: lack of ACID transactions, expensive partition discovery, no time-travel capability, and brittle schema evolution. Open Table Formats were created to solve these problems by adding a rich metadata layer on top of commodity file formats. Apache Iceberg is one of the most prominent open table formats and is deeply integrated into the AWS analytics ecosystem — making it a critical topic for the AWS Data Engineer Associate (DEA-C01) exam.
What Is an Open Table Format?
An open table format defines how data files, metadata files, and manifest files are organized in object storage (such as Amazon S3) so that query engines can treat a collection of files as a proper table — complete with schema, partitioning, transactions, and versioning. The three most common open table formats are:
• Apache Iceberg – Originally developed at Netflix, donated to the Apache Software Foundation.
• Apache Hudi – Originally developed at Uber.
• Delta Lake – Originally developed by Databricks.
AWS services such as Amazon Athena, Amazon EMR, AWS Glue, and Amazon Redshift Spectrum all support Apache Iceberg tables, making it a first-class citizen in the AWS data lake architecture.
What Is Apache Iceberg?
Apache Iceberg is an open table format specification for huge analytic datasets. It stores data in standard columnar formats (Parquet, ORC, or Avro) while maintaining a separate metadata tree that tracks every change to the table. Key characteristics include:
• ACID Transactions: Concurrent readers and writers can safely operate on the same table without conflicts or dirty reads. Iceberg uses optimistic concurrency control.
• Schema Evolution: You can add, drop, rename, or reorder columns without rewriting existing data files. Changes are tracked in the metadata layer.
• Hidden Partitioning: Partition transforms (e.g., day(timestamp), bucket(id, 16)) are defined at the table level. Users write queries using original column names and Iceberg automatically applies partition pruning — eliminating error-prone manual partition management.
• Partition Evolution: You can change the partitioning scheme of a table without rewriting historical data. New data is written with the new partition spec while old data retains the old spec.
• Time Travel & Snapshot Isolation: Every commit creates a new snapshot. You can query the table as of any previous snapshot ID or timestamp, enabling reproducible analytics, auditing, and rollback.
• Compaction & Data Maintenance: Iceberg supports operations like file compaction, orphan file removal, and expire snapshots to optimize storage and query performance.
How Apache Iceberg Works — Architecture Deep Dive
Iceberg organizes table information into three layers:
1. Metadata File (metadata.json)
This is the root of truth for an Iceberg table. It contains the table schema, partition spec, sort order, current snapshot pointer, and a history of all snapshots. Each write operation produces a new metadata file.
2. Manifest List (snap-*.avro)
Each snapshot references a manifest list file. The manifest list is an Avro file that enumerates all manifest files belonging to that snapshot along with partition-level statistics (min/max values, row counts).
3. Manifest Files (*.avro)
Each manifest file is an Avro file that tracks a subset of data files. It contains per-file metadata: file path, file format, partition values, column-level statistics (min, max, null count, value count), and file size. This allows engines to skip entire manifest files and data files during query planning — a process called manifest pruning and data file pruning.
4. Data Files (*.parquet, *.orc, *.avro)
The actual data stored in columnar (or row) format in Amazon S3.
When a query is executed:
• The engine reads the current metadata file to find the latest snapshot.
• It reads the manifest list to discover manifest files, pruning irrelevant ones using partition statistics.
• It reads relevant manifest files to discover data files, pruning irrelevant files using column statistics.
• It reads only the necessary data files — dramatically reducing I/O.
Apache Iceberg on AWS — Key Integrations
Amazon Athena: Athena v3 natively supports creating, querying, and managing Iceberg tables. You can use CREATE TABLE ... USING iceberg, perform INSERT, UPDATE, DELETE, MERGE INTO, time-travel queries (FOR TIMESTAMP AS OF / FOR VERSION AS OF), and maintenance operations like OPTIMIZE (compaction) and VACUUM (expire snapshots and remove orphan files). Athena stores Iceberg table metadata in the AWS Glue Data Catalog.
AWS Glue: AWS Glue ETL jobs (Spark-based) can read and write Iceberg tables. The Glue Data Catalog serves as the Iceberg catalog, storing the pointer to the current metadata file. Glue crawlers can also catalog Iceberg tables. Glue supports Iceberg's ACID writes, schema evolution, and compaction within ETL pipelines.
Amazon EMR: EMR clusters running Spark, Trino, or Presto can interact with Iceberg tables. EMR provides optimized Iceberg connectors and integrates with the Glue Data Catalog as the metastore.
Amazon Redshift Spectrum: Redshift can query Iceberg tables stored in S3 via Spectrum, enabling seamless lakehouse queries that join Redshift warehouse data with Iceberg-managed lake data.
Amazon S3: Iceberg data and metadata files reside in S3. S3's consistency model (strong read-after-write consistency) aligns well with Iceberg's atomic metadata file swaps.
Key Concepts for the Exam
Compaction (OPTIMIZE): Over time, many small files accumulate from streaming or frequent batch writes. Compaction rewrites small files into larger ones to improve read performance. In Athena, you run OPTIMIZE table_name REWRITE DATA USING BIN_PACK.
Snapshot Expiration (VACUUM): Old snapshots consume storage. VACUUM or expire_snapshots removes metadata and data files that are no longer referenced by any retained snapshot. This is essential for cost optimization.
MERGE INTO: Iceberg supports upsert operations. This is critical for CDC (Change Data Capture) workloads where you need to apply inserts, updates, and deletes from a source system into the data lake.
Copy-on-Write vs. Merge-on-Read: Iceberg supports two write modes. Copy-on-Write (COW) rewrites affected data files on every update/delete — optimized for read-heavy workloads. Merge-on-Read (MOR) writes delete files alongside data files and merges at read time — optimized for write-heavy workloads. The exam may ask you to choose the appropriate mode based on workload patterns.
Row-Level Operations: Unlike Hive tables, Iceberg supports row-level UPDATE and DELETE operations, enabling lakehouse patterns that were previously only possible in data warehouses.
Catalog: Iceberg needs a catalog to store the pointer to the latest metadata.json. On AWS, the AWS Glue Data Catalog is the standard Iceberg catalog. This is important because it provides a centralized, managed metastore.
Iceberg vs. Hudi vs. Delta Lake (Exam Relevance)
While the exam is not deeply comparative, you should know:
• Iceberg is the preferred open table format across most AWS analytics services (Athena, EMR, Glue, Redshift).
• Hudi is well-supported on EMR and is optimized for incremental data processing and CDC use cases.
• Delta Lake is primarily associated with Databricks but can also be used on EMR.
• When the exam mentions "open table format" or "ACID transactions on S3" or "time travel on a data lake" — think Apache Iceberg first (unless the scenario specifically mentions Hudi or Delta).
Common Use Cases Tested in the Exam
1. Building a Lakehouse Architecture: Using Iceberg tables in S3 with Athena/Redshift for combined warehouse and lake workloads.
2. CDC Ingestion: Streaming CDC events from DMS into S3, then using MERGE INTO on Iceberg tables to apply changes.
3. Data Compliance (GDPR): Using row-level DELETE on Iceberg tables to remove individual records for regulatory compliance — something that was extremely difficult with traditional Hive-style tables.
4. Cost Optimization: Running compaction (OPTIMIZE) and snapshot expiration (VACUUM) to reduce storage costs and improve query performance.
5. Schema Changes Without Downtime: Evolving schemas (adding/removing columns) without rewriting existing data.
6. Reproducible Analytics: Using time-travel queries to access historical versions of a table for auditing or debugging.
Exam Tips: Answering Questions on Open Table Formats with Apache Iceberg
1. "ACID on S3" = Iceberg (or Hudi/Delta): Whenever a question asks about ACID transactions on Amazon S3, the answer involves an open table format. If the question mentions Athena or Glue, Apache Iceberg is almost always the best choice.
2. Time Travel Keywords: If a question mentions querying historical data, point-in-time queries, or rolling back to a previous version of a table, think Iceberg snapshots and time travel. Remember the SQL syntax: SELECT * FROM table FOR TIMESTAMP AS OF '2024-01-01'.
3. Row-Level Deletes for Compliance: Questions about GDPR, CCPA, or right-to-be-forgotten on a data lake point to Iceberg's row-level DELETE capability. Traditional Hive tables cannot do this efficiently.
4. Small File Problem: If a scenario describes degraded query performance due to many small files from streaming ingestion, the answer is to run compaction (OPTIMIZE with BIN_PACK in Athena, or rewrite_data_files in Spark).
5. Hidden Partitioning: If a question describes partition-related user errors (e.g., users forgetting to filter on the partition column), Iceberg's hidden partitioning is the solution — the partitioning is transparent to the user.
6. Schema Evolution Without Rewrite: If a question asks about adding columns to a large existing table without downtime or data rewriting, Iceberg's schema evolution is the answer.
7. Glue Data Catalog as Iceberg Catalog: Remember that on AWS, the Glue Data Catalog is the standard catalog for Iceberg tables. Questions about centralized metadata management for Iceberg tables point to this.
8. MERGE INTO for Upserts: CDC or upsert scenarios on a data lake should make you think of Iceberg's MERGE INTO statement. This is supported in both Athena and Spark on EMR/Glue.
9. Copy-on-Write vs. Merge-on-Read: If a question specifies a read-heavy workload with infrequent updates, choose Copy-on-Write. For write-heavy workloads with frequent updates, choose Merge-on-Read.
10. VACUUM / Expire Snapshots for Cost: Questions about reducing S3 storage costs for Iceberg tables point to expiring old snapshots and removing orphan files.
11. Eliminate Wrong Answers: If an answer choice suggests recreating or rewriting the entire table to change the schema or partitioning, it is likely wrong — Iceberg supports schema evolution and partition evolution without full rewrites.
12. Know the Service Pairings: Athena + Iceberg + Glue Data Catalog is the most commonly tested combination. If a question mentions a serverless query engine on S3 with ACID support, this trio is the answer.
13. Distinguish Iceberg from Plain Parquet: Plain Parquet files on S3 queried via Athena do NOT support ACID, time travel, or row-level mutations. If the question requires any of these features, the table must use an open table format like Iceberg.
14. Watch for "Open Table Format" Phrasing: The exam may not always say "Iceberg" explicitly. It may use terms like "open table format," "table format with ACID," or "lakehouse table format." Recognize these as references to Iceberg (or Hudi/Delta) and apply the appropriate concepts.
By understanding these core concepts and exam strategies, you will be well-prepared to tackle any question about Apache Iceberg and open table formats on the AWS Data Engineer Associate exam.
Unlock Premium Access
AWS Certified Data Engineer - Associate + ALL Certifications
- Access to ALL Certifications: Study for any certification on our platform with one subscription
- 2970 Superior-grade AWS Certified Data Engineer - Associate practice questions
- Unlimited practice tests across all certifications
- Detailed explanations for every question
- AWS DEA-C01: 5 full exams plus all other certification exams
- 100% Satisfaction Guaranteed: Full refund if unsatisfied
- Risk-Free: 7-day free trial with all premium features!