Back to Data Ingestion and Transformation

Batch Data Ingestion with S3 and AWS Glue

5 minutes 5 Questions

Batch Data Ingestion with S3 and AWS Glue is a fundamental pattern in AWS data engineering for processing large volumes of data at scheduled intervals rather than in real-time. **Amazon S3 as a Data Lake:** Amazon S3 serves as the central storage layer for batch data ingestion. Data from various s…

Batch Data Ingestion with S3 and AWS Glue – Complete Guide for AWS Data Engineer Associate

Introduction

Batch data ingestion is one of the most fundamental patterns in data engineering. In the AWS ecosystem, Amazon S3 and AWS Glue form a powerful combination that enables organizations to collect, catalog, transform, and prepare large volumes of data for analytics. For the AWS Data Engineer Associate exam, understanding how these services work together for batch ingestion is essential.

Why Is Batch Data Ingestion with S3 and AWS Glue Important?

Batch data ingestion is important for several key reasons:

• Cost-Effectiveness: Processing data in batches (as opposed to real-time) is significantly cheaper, making it the preferred approach when near-instant processing is not required.
• Scalability: Amazon S3 provides virtually unlimited storage, and AWS Glue scales compute resources automatically, enabling the processing of terabytes or petabytes of data.
• Data Lake Foundation: S3 serves as the backbone of most AWS data lake architectures. Batch ingestion into S3 is the starting point for countless analytics workloads.
• Schema Discovery and Data Cataloging: AWS Glue Crawlers automatically detect schemas and create metadata tables, making raw data queryable and discoverable.
• Integration Across the AWS Ecosystem: Data ingested into S3 and cataloged by Glue can be consumed by Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, Amazon QuickSight, and many other services.

What Is Batch Data Ingestion with S3 and AWS Glue?

Batch data ingestion refers to the process of collecting and moving data in discrete, scheduled groups (batches) rather than as a continuous stream. In the AWS context:

• Amazon S3 (Simple Storage Service) acts as the central data store – the landing zone, staging area, and often the permanent data lake storage layer. Data from various sources (databases, SaaS applications, on-premises file systems, APIs) is loaded into S3 buckets in raw form.

• AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service that handles the cataloging, transformation, and movement of that data. It includes several components:

  - AWS Glue Data Catalog: A centralized metadata repository that stores table definitions, schemas, and partition information. It acts as a Hive-compatible metastore.
  - AWS Glue Crawlers: Automated processes that scan data in S3 (or other data stores), infer schemas, and populate the Data Catalog with table definitions.
  - AWS Glue ETL Jobs: Serverless Spark-based (PySpark/Scala) or Python Shell jobs that extract data from sources, apply transformations, and load the results into target destinations.
  - AWS Glue Studio: A visual interface for creating, running, and monitoring ETL jobs without writing code.
  - AWS Glue DataBrew: A visual data preparation tool for cleaning and normalizing data without writing code.
  - AWS Glue Triggers and Workflows: Orchestration mechanisms to schedule and chain multiple crawlers and ETL jobs together.

How Does Batch Ingestion with S3 and AWS Glue Work?

A typical batch ingestion pipeline follows these steps:

Step 1: Data Landing in S3
Raw data arrives in an S3 bucket (often called the raw or landing zone). Data can arrive through various mechanisms:
• Direct file uploads (e.g., via AWS CLI, SDK, or the S3 console)
• AWS Database Migration Service (DMS) for database migrations
• AWS Transfer Family for SFTP/FTP transfers
• Third-party connectors or custom applications
• S3 Replication from other accounts or regions

Data is typically organized using a partitioning strategy such as:
s3://my-bucket/raw/year=2024/month=06/day=15/
This partitioning scheme improves query performance and reduces scan costs.

Step 2: Schema Discovery with Glue Crawlers
An AWS Glue Crawler is configured to scan the S3 landing zone. It:
• Detects file formats (Parquet, CSV, JSON, ORC, Avro, etc.)
• Infers column names, data types, and partition keys
• Creates or updates table definitions in the Glue Data Catalog
• Groups files with similar schemas into the same table
Crawlers can be scheduled (e.g., hourly, daily) or triggered on-demand.

Step 3: Data Transformation with Glue ETL Jobs
AWS Glue ETL jobs read data from the source (referenced via the Data Catalog), apply transformations, and write the output to a target location. Common transformations include:
• Format conversion: Converting CSV or JSON to columnar formats like Parquet or ORC for better performance and compression
• Data cleansing: Removing duplicates, handling null values, fixing data types
• Filtering and aggregation: Selecting relevant columns, aggregating records
• Joining datasets: Combining data from multiple sources
• Schema normalization: Flattening nested structures, renaming columns

Glue uses DynamicFrames (an extension of Spark DataFrames) which handle schema inconsistencies gracefully. Key features include:
• ResolveChoice: Handles columns with mixed data types
• Relationalize: Flattens nested JSON structures into relational tables
• Bookmarks: Track previously processed data so jobs only process new or changed files (incremental processing)

Step 4: Writing to the Curated/Processed Zone
Transformed data is written to another S3 location (the processed, curated, or gold zone) typically in optimized formats like Parquet or ORC with snappy compression. The output is also cataloged in the Glue Data Catalog for downstream consumption.

Step 5: Orchestration
AWS Glue Workflows or external orchestrators (like AWS Step Functions, Amazon MWAA/Apache Airflow, or Amazon EventBridge) coordinate the sequence: trigger crawler → run ETL job → trigger another crawler → notify downstream consumers.

Key Concepts to Master for the Exam

1. S3 Bucket Organization and Partitioning
• Understand the multi-zone architecture: raw → staging → curated/processed
• Know how partition keys (year, month, day, hour) reduce data scanned by Athena and Redshift Spectrum
• Understand S3 event notifications that can trigger Glue workflows or Lambda functions

2. AWS Glue Data Catalog
• The Data Catalog is the central metadata store and is compatible with Hive metastore
• It can be shared across services: Athena, Redshift Spectrum, EMR, and Glue ETL
• Databases and tables in the Catalog are metadata only – they do not store actual data
• Resource policies and IAM can control access to the Catalog

3. AWS Glue Crawlers
• Crawlers use classifiers to determine file format; custom classifiers can be created
• Crawlers can handle schema evolution: adding new columns, detecting new partitions
• Crawler configuration options: grouping behavior, schema change policies (e.g., update table, add new table, log changes)
• Running crawlers too frequently increases cost; schedule them appropriately

4. AWS Glue ETL Jobs
• Glue ETL jobs support PySpark, Scala, and Python Shell runtimes
• DPU (Data Processing Unit): A unit of processing power. Each DPU provides 4 vCPUs and 16 GB of memory. You allocate DPUs to jobs to control performance and cost.
• Job bookmarks enable incremental data processing by tracking state from previous job runs
• Glue version: Newer Glue versions support newer Spark versions (e.g., Glue 4.0 supports Spark 3.3)
• Worker types: Standard, G.1X, G.2X, G.025X (for Python shell), and G.4X, G.8X for memory-intensive workloads
• Pushdown predicates: Allow Glue to push partition filtering to the Data Catalog, avoiding full S3 scans
• Error handling: Glue jobs can retry on failure; use CloudWatch Logs for debugging

5. Job Bookmarks
• Bookmarks track which data has been processed so subsequent runs only process new data
• Bookmarks work with S3 sources (tracking files by path, timestamp, and size) and JDBC sources (tracking primary key values)
• Bookmark states: enabled, disabled, or paused. Pausing allows you to reprocess data without resetting the bookmark.

6. AWS Glue Workflows and Triggers
• Workflows orchestrate multiple crawlers and jobs into a single pipeline
• Triggers can be scheduled (cron-based), on-demand, or conditional (triggered when a previous job succeeds or fails)
• EventBridge integration allows triggering Glue jobs from external events

7. Data Formats and Compression
• Converting to columnar formats (Parquet, ORC) reduces storage cost and improves query performance
• Snappy compression is commonly used for a balance of speed and compression ratio
• File sizes matter: too many small files cause overhead; too few large files limit parallelism. The sweet spot is typically 128 MB to 1 GB per file.
• Glue's groupFiles and groupSize options help manage small file problems during reads

8. Security
• Glue uses IAM roles to access S3, the Data Catalog, and other AWS services
• S3 data can be encrypted using SSE-S3, SSE-KMS, or SSE-C
• Glue supports encryption at rest for the Data Catalog, job bookmarks, and job output
• Glue connections can access data in VPCs (e.g., RDS, Redshift) via ENIs in specified subnets
• Lake Formation can be used on top of Glue Data Catalog for fine-grained access control (column-level, row-level, cell-level)

9. Monitoring and Logging
• Glue jobs emit metrics to Amazon CloudWatch (execution time, DPU hours, bytes read/written)
• Spark UI is available for Glue jobs to debug performance issues
• CloudWatch Logs capture job output and error messages
• CloudTrail logs API calls to Glue for auditing

Common Batch Ingestion Architecture Pattern

A common exam scenario involves:

1. Data sources → S3 raw zone (landing)
2. Glue Crawler scans raw data → populates Data Catalog
3. Glue ETL job reads from raw zone → transforms → writes to curated zone in Parquet
4. Second Glue Crawler scans curated zone → updates Data Catalog
5. Athena/Redshift Spectrum queries the curated data via the Catalog
6. The entire workflow is orchestrated by Glue Workflows or Step Functions

Exam Tips: Answering Questions on Batch Data Ingestion with S3 and AWS Glue

1. When the question mentions "serverless ETL" or "no infrastructure management," think AWS Glue. It is the go-to fully managed, serverless ETL service. Do not confuse it with EMR, which requires cluster management (unless EMR Serverless is mentioned).

2. When the question asks about schema discovery or metadata management, the answer is Glue Crawlers and the Glue Data Catalog. If the question mentions making data queryable by Athena or Redshift Spectrum, a Crawler populating the Data Catalog is almost always part of the answer.

3. When incremental/delta processing is mentioned, think Glue Job Bookmarks. This is the mechanism that ensures Glue only processes new or changed data. If the question says "avoid reprocessing already-processed data," bookmarks are the answer.

4. For questions about optimizing query performance, look for answers involving: converting to Parquet/ORC, partitioning data in S3, and using pushdown predicates in Glue ETL jobs.

5. For small file problems, remember Glue's groupFiles parameter and the ability to coalesce/repartition output files. If a question describes slow queries due to thousands of small files, the answer likely involves compaction or merging files during ETL.

6. Understand the difference between Glue Workflows and Step Functions. Glue Workflows are native to Glue and orchestrate Glue-specific resources (crawlers, jobs, triggers). Step Functions provide broader orchestration across multiple AWS services. If the pipeline is purely Glue-based, Glue Workflows may suffice. If it involves Lambda, ECS, and other services, Step Functions is more appropriate.

7. For cost optimization questions, remember: right-size DPUs, use appropriate worker types, enable job bookmarks to avoid reprocessing, convert data to columnar formats, and use S3 lifecycle policies to archive or delete old data.

8. If the question involves data quality or data cleansing, consider AWS Glue DataBrew for visual, no-code data preparation, or Glue Data Quality rules (integrated with Glue ETL) for automated quality checks.

9. When security and access control come up, remember: IAM roles for Glue job execution, KMS for encryption, VPC endpoints for private connectivity, and AWS Lake Formation for fine-grained catalog-level permissions.

10. Pay attention to keywords in the question:
  - "Catalog" or "metadata" → Glue Data Catalog
  - "Automatically discover schema" → Glue Crawler
  - "Serverless transformation" → Glue ETL Job
  - "Central data lake storage" → Amazon S3
  - "Optimize for analytics" → Parquet/ORC + Partitioning
  - "Process only new data" → Job Bookmarks
  - "Visual ETL" → Glue Studio
  - "Visual data preparation" → Glue DataBrew

11. Elimination strategy: If a question presents options including both Glue and a self-managed Spark on EC2 approach, choose Glue unless there is a specific requirement for customization that Glue cannot handle. The exam favors managed, serverless solutions.

12. Remember the multi-zone data lake pattern: Raw → Staging/Cleaned → Curated/Processed. Many questions test whether you understand this layered architecture and which services operate at each layer.

Summary

Batch data ingestion with S3 and AWS Glue is a cornerstone topic for the AWS Data Engineer Associate exam. S3 provides durable, scalable storage for your data lake, while Glue offers serverless crawling, cataloging, and ETL capabilities. Together, they form the foundation of batch processing pipelines on AWS. Focus on understanding the Data Catalog, Crawlers, ETL job configurations (DPUs, bookmarks, worker types), data format optimization, partitioning strategies, and orchestration options. Mastering these concepts will prepare you to confidently answer batch ingestion questions on the exam.

Test mode:

Exam (Timed)

Practice (With explanations)

Start practice test

Engineer Data Pipelines on AWS

DEA-C01 data ingestion, storage & orchestration

Data Pipelines: Glue, Kinesis, EMR, and Step Functions for ETL and streaming
Data Storage: S3, Redshift, DynamoDB, and data lake architecture
Analytics & Visualization: Athena, QuickSight, and data catalog management
100% Satisfaction Guaranteed: Full refund if unsatisfied
Risk-Free: 7-day free trial with all premium features!

More Batch Data Ingestion with S3 and AWS Glue questions

45 questions (total)

Start 45 question test