Batch and Stream Processing Overview – DP-900 Exam Guide
Why Is Batch and Stream Processing Important?
In today's data-driven world, organizations must process massive volumes of data to gain timely insights and make informed decisions. Understanding how data is processed — whether in large scheduled chunks or in real time as it arrives — is fundamental to designing efficient data solutions. The DP-900 (Microsoft Azure Data Fundamentals) exam tests your understanding of these core data processing paradigms because they underpin virtually every modern data architecture, from simple reporting pipelines to complex IoT and analytics platforms.
What Is Batch Processing?
Batch processing is a method of processing data in which large volumes of data are collected, stored, and then processed together as a single group (or batch) at a scheduled time or when a certain data threshold is reached.
Key Characteristics of Batch Processing:
• Data is collected over a period of time before being processed.
• Processing typically occurs on a set schedule (e.g., hourly, daily, weekly).
• High throughput — large datasets are processed efficiently.
• Latency is higher because results are not available until the entire batch completes.
• Well-suited for complex transformations, aggregations, and historical analysis.
• Examples: End-of-day financial reconciliation, monthly payroll processing, ETL (Extract, Transform, Load) jobs that populate data warehouses.
Azure Services for Batch Processing:
• Azure Data Factory — orchestrates and automates data movement and transformation pipelines on a schedule.
• Azure Synapse Analytics — enables large-scale data warehousing and batch analytics using SQL pools or Spark pools.
• Azure Databricks — Apache Spark-based analytics platform for big data batch processing.
• Azure HDInsight — managed clusters for Hadoop, Spark, and other big data frameworks.
What Is Stream Processing?
Stream processing (also called real-time processing) is a method in which data is processed continuously and immediately as individual records or small micro-batches arrive, rather than waiting for a complete dataset to accumulate.
Key Characteristics of Stream Processing:
• Data is processed in real time or near real time as it is generated.
• Low latency — insights and outputs are produced within seconds or milliseconds.
• Ideal for time-sensitive scenarios where immediate action is required.
• Data is often unbounded — it has no defined beginning or end (a continuous flow).
• Suitable for detecting patterns, anomalies, and triggering alerts instantly.
• Examples: Fraud detection on credit card transactions, live social media sentiment analysis, IoT sensor monitoring, real-time stock ticker dashboards.
Azure Services for Stream Processing:
• Azure Stream Analytics — a fully managed real-time analytics service that uses SQL-like queries to process streaming data from sources like Azure Event Hubs, IoT Hub, and Azure Blob Storage.
• Azure Event Hubs — a big data streaming platform and event ingestion service capable of receiving millions of events per second.
• Azure IoT Hub — a managed service for ingesting and processing data from IoT devices.
• Apache Spark Structured Streaming (via Azure Synapse Analytics or Azure Databricks) — processes streaming data using Spark.
How Do Batch and Stream Processing Compare?
Latency:
• Batch: High latency (minutes, hours, or days)
• Stream: Low latency (milliseconds to seconds)
Data Scope:
• Batch: Processes a bounded, finite dataset
• Stream: Processes unbounded, continuous data
Complexity:
• Batch: Easier to implement for large-scale transformations
• Stream: More complex due to handling out-of-order events, windowing, and state management
Use Cases:
• Batch: Historical reporting, data warehouse loading, periodic aggregations
• Stream: Real-time dashboards, alerting, fraud detection, live monitoring
Throughput:
• Batch: Very high throughput for large volumes
• Stream: Optimized for continuous processing of smaller units of data
Hybrid / Lambda Architecture:
Many organizations use both batch and stream processing together. This is sometimes referred to as a Lambda Architecture, where:
• A batch layer handles comprehensive historical processing for accuracy.
• A speed layer (stream) handles real-time processing for immediacy.
• A serving layer merges results from both for querying.
Azure Synapse Analytics supports both batch and streaming workloads, making it a powerful platform for hybrid architectures.
How It Works — A Practical Example:
Scenario: An e-commerce company tracks customer purchases.
1. Stream Processing Path: As each purchase occurs, the transaction event is sent to Azure Event Hubs. Azure Stream Analytics picks up the event in real time, checks for potentially fraudulent activity, and triggers an alert if needed. A live dashboard updates instantly to show current sales figures.
2. Batch Processing Path: At the end of each day, Azure Data Factory runs a pipeline that extracts all daily transactions from the operational database, transforms and cleans the data, and loads it into Azure Synapse Analytics for detailed historical reporting and trend analysis.
Both paths work together to give the business immediate operational awareness and deep historical insights.
Exam Tips: Answering Questions on Batch and Stream Processing Overview
1. Know the key differentiator: The most fundamental distinction the exam tests is latency. If a scenario requires immediate or real-time processing, the answer points to stream processing. If the scenario involves scheduled, periodic, or historical processing, the answer points to batch processing.
2. Map scenarios to the correct processing type: The DP-900 exam frequently presents real-world scenarios. Practice identifying whether a scenario describes batch or stream processing:
• "Process sensor data as it arrives" → Stream
• "Generate a weekly sales report" → Batch
• "Detect fraudulent transactions in real time" → Stream
• "Load data into a data warehouse nightly" → Batch
3. Know the Azure services: Be able to associate the correct Azure service with the correct processing type. Key associations:
• Stream → Azure Stream Analytics, Azure Event Hubs, Azure IoT Hub
• Batch → Azure Data Factory, Azure Synapse Analytics (SQL/Spark pools), Azure Databricks
4. Understand the term "unbounded" vs "bounded": Stream data is unbounded (continuous, no defined end). Batch data is bounded (finite, with a clear start and end). The exam may use these terms.
5. Remember windowing concepts for streaming: Stream processing often uses windowing (tumbling, hopping, sliding, session windows) to group events within time intervals. You don't need deep technical details for DP-900, but knowing that windowing is a stream processing concept can help.
6. Don't confuse ETL with streaming: ETL (Extract, Transform, Load) is traditionally a batch processing pattern. If a question mentions ETL, scheduled pipelines, or data warehouse loading, think batch.
7. Watch for hybrid scenarios: Some questions may describe a solution that uses both batch and stream processing. Recognize that these are complementary, not mutually exclusive, and that modern architectures often combine both.
8. Pay attention to keywords in questions:
• Keywords suggesting batch: scheduled, periodic, nightly, weekly, historical, ETL, data warehouse, large volume processing
• Keywords suggesting stream: real-time, immediate, continuous, as it arrives, live, event-driven, alerts, IoT
9. Understand that stream processing trades completeness for speed: Batch processing ensures all data is available before processing (complete picture), while stream processing prioritizes speed and may need to handle late-arriving data. This trade-off is a key concept.
10. Practice with elimination: On multiple-choice questions, eliminate answers that mismatch the latency requirement of the scenario. If the scenario demands real-time results, any batch-only answer can be eliminated immediately, and vice versa.