Querying Data with Amazon Athena
Amazon Athena is a serverless, interactive query service that enables data engineers to analyze data directly in Amazon S3 using standard SQL. It requires no infrastructure setup or management, making it a powerful tool for ad-hoc querying and data exploration in AWS data pipelines. **Key Concepts… Amazon Athena is a serverless, interactive query service that enables data engineers to analyze data directly in Amazon S3 using standard SQL. It requires no infrastructure setup or management, making it a powerful tool for ad-hoc querying and data exploration in AWS data pipelines. **Key Concepts:** 1. **Serverless Architecture:** Athena eliminates the need to provision clusters or servers. You simply point to your data in S3, define the schema, and start querying. You pay only for the amount of data scanned per query, making it cost-effective. 2. **Schema-on-Read:** Athena uses a schema-on-read approach, meaning data doesn't need to be transformed before querying. You define table schemas in the AWS Glue Data Catalog or Athena's internal catalog, which maps to the underlying S3 data. 3. **Supported Formats:** Athena supports multiple data formats including CSV, JSON, Parquet, ORC, and Avro. Columnar formats like Parquet and ORC are recommended as they significantly reduce data scanned, improving performance and lowering costs. 4. **Integration with AWS Glue:** The Glue Data Catalog serves as a centralized metadata repository. Glue Crawlers can automatically discover schemas and partition structures, making data immediately queryable in Athena. 5. **Partitioning:** Partitioning data by commonly filtered columns (e.g., date, region) dramatically reduces the amount of data scanned, optimizing both performance and cost. 6. **Workgroups:** Athena workgroups help manage query access, track costs, and enforce data usage limits across teams. 7. **Federated Queries:** Athena supports querying data beyond S3, including relational databases, DynamoDB, and other sources using Lambda-based connectors. 8. **CTAS and Views:** Create Table As Select (CTAS) statements allow you to transform and store query results in optimized formats. Views enable reusable query logic. **Best Practices:** Compress data, use columnar formats, partition effectively, and leverage query result reuse to optimize performance. Athena is ideal for log analysis, data lake exploration, and building lightweight ETL workflows within the AWS ecosystem.
Querying Data with Amazon Athena: A Complete Guide for the AWS Data Engineer Associate Exam
Why Is Querying Data with Amazon Athena Important?
Amazon Athena is one of the most critical services tested on the AWS Data Engineer Associate (DEA-C01) exam. As a serverless, interactive query service, Athena enables data engineers to analyze data directly in Amazon S3 using standard SQL without provisioning or managing any infrastructure. Understanding Athena is essential because it sits at the intersection of data storage (S3), data cataloging (AWS Glue Data Catalog), and data analytics — three core pillars of the AWS data engineering ecosystem. In real-world scenarios, data engineers use Athena for ad-hoc analysis, ETL validation, log analysis, and building lightweight data pipelines.
What Is Amazon Athena?
Amazon Athena is a serverless, interactive analytics service built on the open-source Presto (now Trino) query engine. It allows you to run SQL queries directly against data stored in Amazon S3. Key characteristics include:
• Serverless: No infrastructure to manage. You simply point Athena at your data, define a schema, and start querying.
• Pay-per-query: You are charged based on the amount of data scanned by each query (approximately $5 per TB scanned).
• Standard SQL: Athena supports ANSI SQL, including complex joins, window functions, and nested queries.
• Integration with AWS Glue Data Catalog: Athena uses the AWS Glue Data Catalog as its metastore, allowing you to define databases and tables that reference data in S3.
• Support for multiple data formats: CSV, TSV, JSON, Parquet, ORC, Avro, and more.
• Federated queries: Athena can query data sources beyond S3 (e.g., DynamoDB, RDS, Redshift, on-premises databases) using Athena Federated Query with Lambda-based connectors.
How Does Amazon Athena Work?
Understanding the architecture and workflow of Athena is critical for the exam:
1. Data Storage in S3
Data resides in Amazon S3 buckets. Athena reads data directly from S3 — it does not load or move data. The data can be in various formats, but columnar formats like Parquet and ORC are strongly recommended for performance and cost optimization.
2. Schema Definition via AWS Glue Data Catalog
Before querying, you must define a schema (database and table) in the AWS Glue Data Catalog. This can be done by:
• Running a Glue Crawler to automatically discover and catalog data
• Using CREATE TABLE DDL statements directly in the Athena console
• Using the Glue console or API to manually define tables
The Glue Data Catalog stores metadata such as table names, column definitions, data types, S3 locations, and serialization/deserialization (SerDe) information.
3. Query Execution
When you submit a SQL query in Athena:
• Athena parses the query and references the Glue Data Catalog for schema information
• It determines which S3 objects need to be read based on the table location and any partitions
• The query engine (Trino-based) reads and processes the data in parallel
• Results are returned to the user and stored in a designated S3 output location
4. Query Results
Query results are saved as CSV files in an S3 output bucket that you specify. You can also access results programmatically via the Athena API, CLI, or SDK.
Key Concepts and Features for the Exam
Partitioning
Partitioning is one of the most important optimization techniques in Athena. By organizing data in S3 into partition-based folder structures (e.g., s3://bucket/table/year=2024/month=01/day=15/), Athena can skip irrelevant data during queries. This significantly reduces the amount of data scanned, improving both performance and cost.
• Hive-style partitioning: Folder names follow the key=value format. Athena can automatically recognize these.
• Non-Hive-style partitioning: Requires partition projection or manual partition addition using ALTER TABLE ADD PARTITION.
• Partition Projection: A feature that allows Athena to calculate partition values at query time instead of reading them from the Glue Data Catalog. This is ideal for time-series data with predictable partition patterns and eliminates the need for MSCK REPAIR TABLE or Glue Crawlers to update partitions.
Data Formats and Compression
• Columnar formats (Parquet, ORC): Dramatically reduce the amount of data scanned because Athena only reads the columns referenced in the query. This leads to faster queries and lower costs.
• Compression: Supported algorithms include GZIP, Snappy, LZO, and ZSTD. Snappy is commonly used with Parquet. Compression reduces the data volume scanned.
• Conversion tip: Use AWS Glue ETL jobs or Athena CTAS (CREATE TABLE AS SELECT) queries to convert raw CSV/JSON data into Parquet or ORC format.
CTAS (CREATE TABLE AS SELECT)
CTAS allows you to create a new table from the results of a SELECT query. This is powerful for:
• Converting data formats (e.g., CSV to Parquet)
• Creating partitioned tables from non-partitioned data
• Materializing query results for downstream consumption
• Data transformation and ETL within Athena
Views
Athena supports creating views, which are saved queries that appear as tables. Views are useful for abstracting complex queries, enforcing access patterns, and simplifying downstream consumption.
Workgroups
Workgroups allow you to separate users, teams, or applications, and to set:
• Data usage controls: Per-query and per-workgroup data scan limits
• Query result locations: Different S3 output paths per workgroup
• Cost tracking: Track and control query costs per workgroup using CloudWatch metrics and AWS Cost Explorer tags
• IAM policies: Restrict access to specific workgroups
Athena Query Federation
Using the Athena Federated Query feature, you can run SQL queries across data in:
• Amazon DynamoDB
• Amazon RDS and Aurora
• Amazon Redshift
• Amazon CloudWatch Logs
• On-premises databases
• Custom data sources
This uses Lambda-based data source connectors. Each connector is a Lambda function that translates between Athena and the target data source. AWS provides pre-built connectors, and you can build custom ones.
Athena Engine Versions
• Athena engine version 2: Based on Presto 0.217
• Athena engine version 3: Based on Trino, with improved performance, new SQL functions, and better query planning
• The exam may reference engine version differences. Version 3 is the current default.
Integration with Other AWS Services
• AWS Glue Data Catalog: Serves as the central metastore
• Amazon S3: Primary data source and result storage
• AWS Lake Formation: Provides fine-grained access control (column-level and row-level security) for Athena queries
• Amazon QuickSight: Athena is a common data source for QuickSight dashboards
• AWS Step Functions: Orchestrate Athena queries in pipelines
• Amazon CloudWatch: Monitor query metrics, set alarms on data scanned
• AWS IAM: Controls access to Athena actions, Glue Data Catalog resources, and S3 data
Security in Athena
• Encryption at rest: Athena supports querying encrypted S3 data (SSE-S3, SSE-KMS, CSE-KMS). Query results can also be encrypted.
• Encryption in transit: TLS is used for data in transit
• Access control: IAM policies control who can run queries, access specific databases/tables, and read S3 data. Lake Formation adds fine-grained permissions.
• VPC endpoints: Athena supports VPC interface endpoints (PrivateLink) for private network access.
Performance Optimization Techniques
These are heavily tested on the exam:
1. Use columnar formats (Parquet/ORC): Reduces data scanned dramatically
2. Partition your data: Ensures Athena only scans relevant partitions
3. Use compression: Reduces file size and data scanned
4. Avoid small files: Many small files increase overhead. Aim for files between 128 MB and 512 MB. Use Glue ETL or S3DistCp to merge small files.
5. Use partition projection: Eliminates partition metadata lookups for predictable partition schemes
6. Bucket your data: Bucketing within partitions helps with joins and aggregations on specific columns
7. Limit columns in SELECT: Avoid SELECT *; query only the columns you need
8. Use LIMIT: For exploratory queries, use LIMIT to reduce output
9. Use CTAS for repeated queries: Materialize results to avoid re-scanning raw data
Common Use Cases
• Ad-hoc querying and exploration of data lake contents
• Log analysis (CloudTrail, VPC Flow Logs, ALB access logs, S3 access logs)
• ETL validation and data quality checks
• Creating derived datasets using CTAS
• Federated queries across multiple data stores
• Building dashboards via QuickSight integration
• Cost-effective alternative to running a full Redshift cluster for light analytics workloads
Athena vs. Other AWS Analytics Services
• Athena vs. Redshift: Athena is serverless and pay-per-query; Redshift is a provisioned (or serverless) data warehouse for heavy, complex analytics. Use Athena for ad-hoc queries on S3 data; use Redshift for repeated, complex workloads on structured data.
• Athena vs. Redshift Spectrum: Redshift Spectrum also queries S3 but requires a Redshift cluster. Athena is fully independent. Spectrum is best when you already have Redshift and want to extend queries to S3.
• Athena vs. EMR: EMR provides more flexibility and control (Spark, Hive, Presto). Athena is simpler for SQL-only workloads. EMR is better for large-scale, complex data processing pipelines.
Pricing Model
• Charged per TB of data scanned (~$5/TB in most regions)
• DDL statements (CREATE, ALTER, DROP) and failed queries are free
• Cancelled queries are charged for the data scanned before cancellation
• Using columnar formats, compression, and partitioning directly reduce costs
Exam Tips: Answering Questions on Querying Data with Amazon Athena
1. When you see "serverless" and "query S3 data with SQL": Think Athena immediately. This is the classic use case.
2. When the question mentions reducing query costs or improving performance: The answer almost always involves one or more of: converting to Parquet/ORC, partitioning data, compressing data, or avoiding SELECT *.
3. When the question mentions querying data from multiple sources (DynamoDB, RDS, etc.) with a single SQL query: Think Athena Federated Query with Lambda connectors.
4. When you see scenarios involving log analysis (CloudTrail logs, VPC Flow Logs, ALB logs): Athena is the recommended service. AWS provides pre-built table definitions for common log formats.
5. When the question involves managing costs or isolating query workloads for different teams: Think Athena Workgroups with data usage controls and per-query scan limits.
6. When the question mentions converting data formats within a data lake: Think CTAS (CREATE TABLE AS SELECT) in Athena or AWS Glue ETL jobs.
7. When the question mentions too many small files causing slow queries: The solution is to consolidate small files into larger ones (128-512 MB) using Glue ETL, CTAS, or S3DistCp.
8. When the question involves fine-grained access control (column-level or row-level): Think AWS Lake Formation integrated with Athena.
9. When the question mentions partition management overhead (e.g., running MSCK REPAIR TABLE frequently): The answer is likely Partition Projection, which eliminates the need to update partition metadata.
10. When choosing between Athena and Redshift: If the scenario emphasizes serverless, pay-per-query, ad-hoc analysis, or querying data in S3 without loading it, choose Athena. If the scenario requires complex, repeated analytics with high concurrency on structured data, choose Redshift.
11. Remember the Glue Data Catalog connection: Athena does NOT have its own metastore. It relies on the AWS Glue Data Catalog. Any question about defining schemas for Athena involves the Glue Data Catalog or Glue Crawlers.
12. Understand encryption: Athena can query SSE-S3, SSE-KMS, and CSE-KMS encrypted data. Query results can be separately encrypted. Know that CSE with a custom client-side key (CSE-C) is NOT supported.
13. Watch for distractors: Questions may try to confuse Athena with EMR (Presto on EMR) or Redshift Spectrum. Always look for keywords like "serverless," "no infrastructure management," and "pay per query" to confirm Athena is the right answer.
14. Know that Athena query results go to S3: Results are stored as CSV in a specified S3 output location. This is a commonly tested detail.
15. Understand Athena's limitations: Athena is not ideal for transaction processing, real-time analytics, or extremely complex ETL pipelines. It is best for interactive, SQL-based analytics on data at rest in S3.
By mastering these concepts and tips, you will be well-prepared to answer any Athena-related question on the AWS Data Engineer Associate exam with confidence.
Unlock Premium Access
AWS Certified Data Engineer - Associate + ALL Certifications
- Access to ALL Certifications: Study for any certification on our platform with one subscription
- 2970 Superior-grade AWS Certified Data Engineer - Associate practice questions
- Unlimited practice tests across all certifications
- Detailed explanations for every question
- AWS DEA-C01: 5 full exams plus all other certification exams
- 100% Satisfaction Guaranteed: Full refund if unsatisfied
- Risk-Free: 7-day free trial with all premium features!