Delta Lake Read and Write Operations

5 minutes 5 Questions

Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified streaming and batch data processing to data lakes. In Azure, Delta Lake is deeply integrated with Azure Databricks and Synapse Analytics, making read and write operations seamless and r…

Delta Lake Read and Write Operations – Complete Guide for DP-203

Delta Lake Read and Write Operations

Why Are Delta Lake Read and Write Operations Important?

Delta Lake is a cornerstone technology in the Azure data engineering ecosystem, and understanding its read and write operations is critical for the DP-203 (Data Engineering on Microsoft Azure) exam. Delta Lake provides ACID transaction guarantees on top of data lakes, which solves many of the reliability and consistency challenges that traditional data lakes face. As an Azure Data Engineer, you will frequently need to read from and write to Delta tables in Azure Synapse Analytics, Azure Databricks, and Azure Data Lake Storage. Mastering these operations ensures you can build robust, performant, and reliable data pipelines.

Delta Lake read and write operations are important because they:

- Enable ACID transactions on data lakes, preventing partial writes and ensuring data consistency
- Support schema enforcement and schema evolution, protecting data quality
- Allow time travel (querying historical versions of data)
- Provide optimized reads through data skipping and Z-ordering
- Support both batch and streaming workloads seamlessly

What Are Delta Lake Read and Write Operations?

Delta Lake read and write operations refer to the various methods and APIs used to interact with Delta tables stored in a data lake (typically Azure Data Lake Storage Gen2). Delta Lake uses the Parquet file format under the hood but adds a transaction log (the _delta_log directory) that tracks every change made to the table.

Write Operations include:

1. CREATE TABLE / WRITE: Creating a new Delta table or writing data to an existing one.
2. INSERT: Adding new rows to a Delta table.
3. UPDATE: Modifying existing rows based on a condition.
4. DELETE: Removing rows based on a condition.
5. MERGE (UPSERT): Combining insert, update, and delete operations in a single atomic operation based on matching conditions.
6. OVERWRITE: Replacing the entire table or specific partitions with new data.

Read Operations include:

1. Standard Reads: Reading the current version of a Delta table.
2. Time Travel Reads: Reading a specific historical version of a Delta table using version number or timestamp.
3. Streaming Reads: Using Delta as a streaming source with Structured Streaming.

How Do Delta Lake Read and Write Operations Work?

1. Writing Data to Delta Tables

Using DataFrame API (PySpark):

To create or overwrite a Delta table:
df.write.format("delta").mode("overwrite").save("/mnt/delta/my_table")

To append data:
df.write.format("delta").mode("append").save("/mnt/delta/my_table")

To write to a managed Delta table:
df.write.format("delta").saveAsTable("my_database.my_table")

Save Modes:
- overwrite: Replaces the existing data entirely
- append: Adds new data to the existing table
- ignore: Does nothing if the table already exists
- error/errorifexists: Throws an error if the table already exists (default)

2. MERGE (Upsert) Operations

The MERGE operation is one of the most powerful features of Delta Lake. It allows you to match source data against a target Delta table and perform insert, update, or delete actions conditionally.

SQL Syntax:
MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id
WHEN MATCHED AND s.action = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE SET t.value = s.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (s.id, s.value)

Python DeltaTable API:
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/mnt/delta/my_table")
deltaTable.alias("t").merge(sourceDF.alias("s"), "t.id = s.id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()

3. UPDATE and DELETE Operations

SQL:
UPDATE my_table SET status = 'inactive' WHERE last_login < '2023-01-01'
DELETE FROM my_table WHERE status = 'inactive'

Python:
deltaTable.update(condition="last_login < '2023-01-01'", set={"status": "'inactive'"})
deltaTable.delete(condition="status = 'inactive'")

4. Reading Data from Delta Tables

Standard Read:
df = spark.read.format("delta").load("/mnt/delta/my_table")
-- or --
df = spark.table("my_database.my_table")

Time Travel by Version:
df = spark.read.format("delta").option("versionAsOf", 3).load("/mnt/delta/my_table")

Time Travel by Timestamp:
df = spark.read.format("delta").option("timestampAsOf", "2023-10-01").load("/mnt/delta/my_table")

SQL Time Travel:
SELECT * FROM my_table VERSION AS OF 3
SELECT * FROM my_table TIMESTAMP AS OF '2023-10-01'

5. Streaming Read and Write

Delta Lake integrates seamlessly with Spark Structured Streaming.

Streaming Read (Delta as Source):
streamDF = spark.readStream.format("delta").load("/mnt/delta/my_table")

Streaming Write (Delta as Sink):
streamDF.writeStream.format("delta").outputMode("append").option("checkpointLocation", "/mnt/checkpoints/").start("/mnt/delta/output_table")

Key points about streaming with Delta:
- Delta as a streaming source provides exactly-once processing guarantees
- The checkpointLocation is mandatory for streaming writes
- You can use trigger(once=True) or trigger(availableNow=True) for incremental batch processing
- maxFilesPerTrigger and maxBytesPerTrigger options control the read rate

6. Schema Enforcement and Evolution

Schema Enforcement (Schema on Write): By default, Delta Lake rejects writes that do not match the table's schema. This prevents data corruption.

Schema Evolution: You can enable automatic schema evolution by setting:
df.write.format("delta").option("mergeSchema", "true").mode("append").save("/mnt/delta/my_table")

Or for overwrite with schema replacement:
df.write.format("delta").option("overwriteSchema", "true").mode("overwrite").save("/mnt/delta/my_table")

7. Partition Overwrite

Delta supports dynamic partition overwrite, which replaces only the partitions present in the new data:
spark.conf.set("spark.databricks.delta.dynamicPartitionOverwrite.enabled", "true")
df.write.format("delta").mode("overwrite").save("/mnt/delta/partitioned_table")

Alternatively, use replaceWhere to conditionally overwrite specific partitions:
df.write.format("delta").option("replaceWhere", "date = '2023-10-01'").mode("overwrite").save("/mnt/delta/my_table")

8. Transaction Log and Versioning

Every write operation in Delta Lake creates a new version in the transaction log (_delta_log). Each version is represented as a JSON file (e.g., 000000000000000001.json). Every 10 commits, a checkpoint Parquet file is created to speed up log reading. You can view the history using:
DESCRIBE HISTORY my_table

9. VACUUM and Table Maintenance

After many write operations, old data files accumulate. The VACUUM command removes files no longer referenced by the transaction log:
VACUUM my_table RETAIN 168 HOURS

The default retention period is 7 days (168 hours). Warning: Setting retention below 7 days requires disabling a safety check and may break time travel for versions older than the retention period.

10. OPTIMIZE and Z-ORDER

To improve read performance after many small writes:
OPTIMIZE my_table ZORDER BY (column_name)

This compacts small files and co-locates related data for faster queries.

Exam Tips: Answering Questions on Delta Lake Read and Write Operations

1. Know the MERGE syntax thoroughly: The DP-203 exam frequently tests your understanding of MERGE (UPSERT) operations. Be comfortable with SQL syntax and the Python DeltaTable API. Understand WHEN MATCHED, WHEN NOT MATCHED, and WHEN NOT MATCHED BY SOURCE clauses.

2. Understand save modes: Know the difference between overwrite, append, ignore, and error. Exam questions may present scenarios where the wrong save mode leads to data loss or duplication.

3. Time travel is a favorite topic: Expect questions about how to query historical data using versionAsOf and timestampAsOf. Remember that VACUUM removes the ability to time travel to versions older than the retention period.

4. Schema enforcement vs. evolution: Know that Delta Lake enforces schema by default and rejects mismatched writes. Understand when to use mergeSchema (add new columns during append) vs. overwriteSchema (completely replace schema during overwrite).

5. Streaming + Delta: Understand that Delta can serve as both a streaming source and a streaming sink. Remember that checkpointLocation is required for streaming writes. Know that Delta streaming provides exactly-once guarantees.

6. VACUUM retention: Remember the default retention is 7 days (168 hours). Questions may ask what happens if you VACUUM with 0 hours retention — it breaks time travel and may affect concurrent readers.

7. Partition strategies: Know how replaceWhere and dynamic partition overwrite work. These are common in scenarios involving incremental data loads into partitioned tables.

8. OPTIMIZE and Z-ORDER: Understand that OPTIMIZE compacts small files and Z-ORDER organizes data by specified columns. This is important for read performance optimization questions.

9. Watch for ACID transaction questions: Delta Lake guarantees atomicity — if a write fails midway, the partial data is not visible. This is a key differentiator from plain Parquet files.

10. Distinguish Delta from Parquet: If a question asks about UPDATE, DELETE, MERGE, or time travel capabilities, the answer almost certainly involves Delta Lake, not plain Parquet, as Parquet does not natively support these operations.

11. Practice scenario-based questions: The exam often presents a business scenario (e.g., "You need to update existing records and insert new records from a daily feed") and expects you to identify MERGE as the correct approach.

12. Remember CREATE TABLE AS SELECT (CTAS): You can create Delta tables using CTAS syntax: CREATE TABLE my_delta_table USING DELTA AS SELECT * FROM source_table. This is a common pattern in exam questions.

13. Synapse Serverless SQL Pool: Remember that Synapse serverless SQL pool can read Delta Lake tables using OPENROWSET, but it has limited write support. Most write operations are performed through Spark pools or Databricks.

14. Idempotent writes: In pipeline design questions, look for patterns that ensure idempotent writes. MERGE operations and replaceWhere with overwrite mode are common strategies for making pipelines re-runnable without data duplication.

Test mode:

Exam (Timed)

Practice (With explanations)

Start practice test

Unlock Premium Access

Azure Data Engineer Associate

Access to ALL Certifications: Study for any certification on our platform with one subscription
1680 Superior-grade Azure Data Engineer Associate practice questions
Unlimited practice tests across all certifications
Detailed explanations for every question
DP-203: 5 full exams plus all other certification exams
100% Satisfaction Guaranteed: Full refund if unsatisfied
Risk-Free: 7-day free trial with all premium features!

More Delta Lake Read and Write Operations questions

30 questions (total)

Start 30 question test