Back to Data Ingestion and Transformation

Streaming Data Ingestion with Kinesis and MSK

5 minutes 5 Questions

Streaming data ingestion is a critical component for real-time data processing in AWS, primarily facilitated by Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (MSK). **Amazon Kinesis** is a fully managed service suite for real-time streaming data. It includes: - **Kinesis Data Strea…

Streaming Data Ingestion with Kinesis and MSK – Complete Guide for AWS Data Engineer Associate

Why Streaming Data Ingestion Matters

In today's data-driven world, organizations need to process and analyze data in real time or near real time. Batch processing alone is no longer sufficient for use cases like fraud detection, real-time dashboards, IoT telemetry, clickstream analytics, and log monitoring. Streaming data ingestion enables continuous capture of data as it is generated, ensuring low-latency delivery to downstream consumers, analytics engines, and storage layers. For the AWS Data Engineer Associate exam, streaming ingestion is a major topic because it sits at the heart of modern data architectures on AWS.

What Is Streaming Data Ingestion?

Streaming data ingestion is the process of continuously collecting, buffering, and delivering data records from producers (applications, devices, sensors, logs) to consumers (analytics applications, data lakes, databases) with minimal delay. AWS provides two primary managed services for streaming ingestion:

1. Amazon Kinesis
2. Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Both services allow you to build real-time data pipelines, but they differ in architecture, APIs, ecosystem compatibility, and operational characteristics.

Amazon Kinesis – Deep Dive

Amazon Kinesis is a suite of services for real-time data streaming:

a) Amazon Kinesis Data Streams (KDS)
- A real-time data streaming service that captures and stores data records in shards.
- Each shard provides 1 MB/sec input and 2 MB/sec output capacity.
- Data is retained for a default of 24 hours, extendable up to 365 days.
- Producers send data using the Kinesis Producer Library (KPL), AWS SDK, or the Kinesis Agent.
- Consumers read data using the Kinesis Client Library (KCL), AWS Lambda, or the AWS SDK.
- Enhanced Fan-Out provides dedicated 2 MB/sec throughput per consumer per shard via HTTP/2 push, ideal for multiple consumers reading from the same stream simultaneously.
- On-Demand Mode: Automatically scales shard capacity based on throughput; no need to manage shards manually.
- Provisioned Mode: You manually specify the number of shards. You can split or merge shards to scale.
- Partition keys determine which shard a record goes to. Hot partitions occur when partition key distribution is uneven.

b) Amazon Kinesis Data Firehose
- A fully managed delivery service that loads streaming data into destinations like Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, and HTTP endpoints.
- No shard management – it auto-scales.
- Supports data transformation via AWS Lambda (e.g., format conversion, enrichment, filtering).
- Supports record format conversion to Apache Parquet or ORC using an AWS Glue Data Catalog table schema.
- Supports dynamic partitioning to write data to S3 prefixes based on keys in the data, enabling efficient downstream querying.
- Buffering: Configurable by size (1–128 MB) and time (60–900 seconds). Firehose delivers data when either condition is met first.
- Near real-time, not true real-time (minimum ~60 seconds latency).
- Can read directly from Kinesis Data Streams as a source.

c) Amazon Kinesis Data Analytics
- Now known as Amazon Managed Service for Apache Flink.
- Allows running Apache Flink applications on streaming data from Kinesis Data Streams or MSK.
- Supports SQL-based analytics (legacy) and Flink-based Java/Scala/Python applications.
- Use cases: real-time aggregations, anomaly detection, windowed computations, sessionization.

d) Amazon Kinesis Video Streams
- Ingests and stores video, audio, and time-serialized data from devices for analytics and ML processing.
- Less commonly tested on the Data Engineer exam but good to know it exists.

Amazon MSK (Managed Streaming for Apache Kafka) – Deep Dive

Amazon MSK is a fully managed Apache Kafka service that makes it easy to build and run applications that use Apache Kafka to process streaming data.

Key Concepts:
- Brokers: MSK manages Apache Kafka broker nodes across multiple Availability Zones for high availability.
- Topics: Data is organized into topics. Producers write to topics, and consumers read from topics.
- Partitions: Each topic is divided into partitions. Partitions enable parallelism and scaling.
- Consumer Groups: Multiple consumers can form groups to share the load of reading partitions.
- Retention: Configurable, default is 7 days but can be set to indefinite with tiered storage.

MSK Provisioned vs. MSK Serverless:
- MSK Provisioned: You select broker instance types, number of brokers, and storage. You manage capacity planning.
- MSK Serverless: Automatically provisions and scales capacity. No broker management. Pay per throughput. Ideal when you don't want to manage infrastructure.

MSK Connect:
- A feature that lets you deploy Kafka Connect connectors to move data in and out of MSK (e.g., from databases using Debezium CDC connectors, or to S3).
- Fully managed – auto-scales workers.

Security in MSK:
- Encryption at rest with AWS KMS.
- Encryption in transit with TLS.
- Authentication via mutual TLS (mTLS), SASL/SCRAM, or IAM access control.
- Authorization via Kafka ACLs or IAM policies.
- VPC-based networking; brokers run inside your VPC.

MSK vs. Kinesis Data Streams – Key Differences:

Feature | Kinesis Data Streams | Amazon MSK
- Protocol: AWS proprietary API | Apache Kafka protocol
- Message size limit: 1 MB per record | Default 1 MB, configurable up to higher limits
- Scaling unit: Shard | Partition (and broker nodes)
- Retention: 24 hours to 365 days | Configurable, supports tiered storage
- Ecosystem: AWS-native integrations | Open-source Kafka ecosystem (huge community, connectors)
- Consumer model: KCL, Lambda, SDK | Kafka consumers, consumer groups
- Serverless option: On-Demand mode | MSK Serverless
- Ordering: Per shard (partition key) | Per partition
- Use when: AWS-native, simpler use cases | Existing Kafka workloads, need Kafka ecosystem compatibility

How Streaming Ingestion Works – Architecture Patterns

Pattern 1: KDS → Lambda → DynamoDB/S3
Producers send events to Kinesis Data Streams. AWS Lambda reads records in batches and writes to DynamoDB or S3. Good for lightweight, event-driven processing.

Pattern 2: KDS → Firehose → S3 (Data Lake Ingestion)
Kinesis Data Streams captures real-time data. Firehose consumes from KDS, optionally transforms with Lambda, converts to Parquet, and delivers to S3 with dynamic partitioning. This is a very common exam pattern.

Pattern 3: KDS → Managed Apache Flink → Redshift/S3
For complex real-time analytics such as windowed aggregations, joins between streams, and anomaly detection before writing results to storage.

Pattern 4: MSK → MSK Connect (S3 Sink Connector) → S3
Producers write to MSK topics. MSK Connect uses the S3 Sink Connector to deliver data to an S3-based data lake.

Pattern 5: MSK → Amazon Managed Service for Apache Flink → Downstream
Apache Flink application consumes from MSK topics, performs real-time processing, and outputs to S3, Redshift, OpenSearch, etc.

Pattern 6: MSK → Lambda
Lambda can be triggered by MSK as an event source, processing each batch of records from Kafka topics.

Key Concepts You Must Know for the Exam

Kinesis Shard Scaling:
- Resharding: Splitting (double capacity) or merging (reduce capacity) shards.
- On-Demand mode removes the need for manual resharding.
- Provisioned mode requires you to monitor and manage shard counts.

Kinesis Ordering:
- Records with the same partition key always go to the same shard, preserving order within that shard.
- If ordering across all records is needed, use a single shard (but this limits throughput).

Kinesis Enhanced Fan-Out:
- Without it, all consumers share 2 MB/sec per shard (GetRecords API with polling).
- With it, each registered consumer gets its own dedicated 2 MB/sec per shard via SubscribeToShard (push model).
- Use enhanced fan-out when you have multiple consumers or need lower latency (~70ms vs ~200ms).

Firehose Buffering:
- Buffer size and buffer interval determine delivery frequency.
- Smaller buffers = lower latency but more requests to destination.
- Minimum latency is approximately 60 seconds.

Firehose Format Conversion:
- Converts JSON to columnar formats (Parquet, ORC) for efficient querying with Athena, Redshift Spectrum, etc.
- Requires a table definition in AWS Glue Data Catalog.

Firehose Dynamic Partitioning:
- Extracts keys from data records (using JQ expressions or Lambda) and uses them to create S3 prefixes.
- Example: partitioning by customer_id, region, or event_type.
- Reduces the need for post-ingestion ETL to reorganize data.

MSK Tiered Storage:
- Allows long-term retention of data in lower-cost storage while keeping recent data on broker disks.
- Consumers can still read older data transparently.

MSK IAM Access Control:
- Allows Kafka clients to authenticate and authorize using IAM policies instead of managing Kafka ACLs separately.
- Simplifies security management in AWS environments.

Data Deduplication and Idempotency:
- Kinesis: Producers may retry and create duplicates. Use the KPL's built-in deduplication or implement idempotent consumers.
- MSK/Kafka: Supports idempotent producers and exactly-once semantics (EOS) with transactions.

Error Handling in Streaming Consumers:
- Lambda with Kinesis/MSK: Configure bisect-on-error to isolate poison pill records, set max retry attempts, and configure on-failure destinations (SQS, SNS).
- Flink: Use checkpointing and savepoints for fault tolerance and exactly-once processing.

Exam Tips: Answering Questions on Streaming Data Ingestion with Kinesis and MSK

Tip 1: Identify Real-Time vs. Near Real-Time Requirements
- If the question requires true real-time processing (sub-second latency), the answer likely involves Kinesis Data Streams or MSK with custom consumers or Flink.
- If the question says near real-time delivery to S3 or Redshift, think Kinesis Data Firehose.
- Firehose has a minimum latency of ~60 seconds. If the requirement is under 60 seconds, Firehose alone won't work.

Tip 2: Know When to Choose MSK over Kinesis
- If the scenario mentions existing Kafka workloads, Kafka ecosystem, Kafka APIs, Kafka Connect, or migrating from on-premises Kafka, the answer is MSK.
- If the scenario is purely AWS-native with no Kafka dependency, Kinesis is typically the simpler and preferred choice.
- MSK Serverless is the answer when they want Kafka without managing brokers.

Tip 3: Watch for Scaling Keywords
- "Automatically scale" + Kinesis = On-Demand mode for KDS or Firehose.
- "Automatically scale" + Kafka = MSK Serverless.
- "Hot shard" or "uneven distribution" = partition key redesign or shard splitting.

Tip 4: Format Conversion and Partitioning Questions
- If a question asks about converting streaming JSON data to Parquet before landing in S3, the answer is Firehose with record format conversion (backed by Glue Data Catalog).
- If data needs to be organized by specific keys in S3, the answer is Firehose dynamic partitioning.

Tip 5: Multiple Consumers Reading the Same Stream
- Kinesis: Use Enhanced Fan-Out for dedicated throughput per consumer.
- MSK: Use consumer groups for parallel consumption. Different consumer groups independently read all partitions.

Tip 6: Data Transformation During Ingestion
- Lightweight transformations (filtering, enrichment, format changes): Firehose + Lambda transformation.
- Complex transformations (windowing, joins, aggregations): Amazon Managed Service for Apache Flink.

Tip 7: Ordering Guarantees
- Kinesis guarantees ordering per shard using partition keys.
- Kafka/MSK guarantees ordering per partition.
- If order matters, ensure related records share the same partition key (Kinesis) or go to the same partition (MSK).

Tip 8: Security Questions
- Kinesis: Encryption at rest via KMS, encryption in transit via HTTPS, IAM policies for access control.
- MSK: Encryption at rest via KMS, encryption in transit via TLS, authentication via IAM/mTLS/SASL-SCRAM, authorization via IAM or Kafka ACLs.
- If the question emphasizes least operational overhead for authentication on MSK, choose IAM access control.

Tip 9: Cost Optimization Keywords
- Firehose has no shard management cost; you pay per data volume ingested.
- KDS On-Demand is simpler but more expensive than Provisioned mode at steady-state high throughput.
- MSK Serverless is more expensive per GB than MSK Provisioned for high, predictable throughput but cheaper for variable/bursty workloads.

Tip 10: Integration Patterns to Memorize
- KDS → Firehose → S3 (data lake ingestion with optional transformation)
- KDS → Lambda (lightweight event processing)
- KDS → Managed Flink → S3/Redshift (complex stream processing)
- MSK → MSK Connect → S3 (Kafka-native data lake ingestion)
- MSK → Lambda (event-driven Kafka consumption)
- MSK → Managed Flink → downstream (complex Kafka stream processing)

Tip 11: Disaster Recovery and Durability
- KDS replicates data across three AZs synchronously within a region.
- MSK replicates data across brokers in multiple AZs (default 3-AZ deployment recommended).
- For cross-region replication: KDS has no native feature (use Lambda or custom consumers); MSK supports MirrorMaker 2 or MSK Replicator for cross-region/cross-cluster replication.

Tip 12: Eliminate Wrong Answers Quickly
- If the answer says "Firehose" for real-time sub-second processing → eliminate it.
- If the answer says "Kinesis Data Streams" for automatic delivery to S3 without custom consumers → probably wrong (Firehose or Lambda needed).
- If the answer says "MSK" but there's no mention of Kafka, migration from Kafka, or open-source requirements → likely not the best choice.

Tip 13: Know the Limits
- KDS record size: max 1 MB.
- KDS shard: 1 MB/sec in, 2 MB/sec out, 1000 records/sec write.
- Firehose: max record size 1 MB (but can buffer up to 128 MB before delivery).
- If the scenario involves records larger than 1 MB, consider using S3 as the payload store with a reference pointer in the stream (claim check pattern).

Summary

Streaming data ingestion with Kinesis and MSK is a foundational topic for the AWS Data Engineer Associate exam. Remember that Kinesis is the AWS-native choice for simpler, serverless-friendly streaming architectures, while MSK is the choice when Apache Kafka compatibility is required. Firehose is the managed delivery mechanism for near real-time loading to destinations like S3 and Redshift. Managed Apache Flink handles complex stream processing. Always match the service to the requirements described in the question: latency, ecosystem, scaling model, transformation complexity, and cost.

Test mode:

Exam (Timed)

Practice (With explanations)

Start practice test

Unlock Premium Access

AWS Certified Data Engineer - Associate

Access to ALL Certifications: Study for any certification on our platform with one subscription
2970 Superior-grade AWS Certified Data Engineer - Associate practice questions
Unlimited practice tests across all certifications
Detailed explanations for every question
AWS DEA-C01: 5 full exams plus all other certification exams
100% Satisfaction Guaranteed: Full refund if unsatisfied
Risk-Free: 7-day free trial with all premium features!

More Streaming Data Ingestion with Kinesis and MSK questions

45 questions (total)

Start 45 question test