AWS Glue Flashcards

Question 1

AWS Glue Data Catalog

Accepted Answer

The AWS Glue Data Catalog is a fully managed, Hive-compatible, and Apache Spark compatible metadata repository that enables users to store, discover, query, and manage their data. It integrates with data stores like Amazon S3, Amazon Redshift, and Amazon RDS. The Data Catalog stores metadata about the datasets and allows you to utilize this metadata in ETL jobs or data analytics. The searchable, centralized repository allows for faster data discovery and collaboration between data scientists and engineers while maintaining a reliable and consistent view of your data.

Question 2

AWS Glue Crawlers

Accepted Answer

AWS Glue Crawlers are used to connect to your source or target data store, explore a priority list of data to determine its structure, schema, and statistics, and then populate metadata information in the AWS Glue Data Catalog. A single Glue crawler can crawl multiple data stores of various types, making it an efficient way to discover and catalog your data. Crawlers can use classifiers to automatically recognize code, allowing them to process different types of data formats. They can be run on custom schedules or triggered by events, giving you flexibility when managing and updating metadata.

Question 3

AWS Glue ETL Jobs

Accepted Answer

An AWS Glue ETL Job is the code that runs in the managed Apache Spark environment to perform the necessary data transformations. You can write the ETL code in Python or Scala, and AWS Glue takes care of managing the underlying Spark infrastructure for you. Based on the metadata stored in the Data Catalog, Glue ETL Jobs can load, transform, and unload the data in a single job. You can create, run, test, and monitor these jobs using the AWS Glue Console, giving you a scalable and serverless approach to your ETL processes.

Question 4

AWS Glue Triggers

Accepted Answer

AWS Glue Triggers are used to coordinate the execution of ETL Jobs and Crawlers. Within a complex ETL workflow, certain jobs may need to complete before others, or multiple jobs may need to be triggered based on a specific event. AWS Glue supports event-based, on-demand, and scheduled triggers, allowing you to create or activate workflows based on specific conditions, such as completion or failure of previous jobs, or time-based scheduling. Triggers enable you to maintain a more efficient ETL process by coordinating tasks and minimizing unnecessary processing or mechanical sympathy.

Question 5

AWS Glue Development Endpoints

Accepted Answer

AWS Glue Development Endpoints provide an interactive environment to develop, debug, and test your ETL code before deploying it as a Glue ETL job. These endpoints act as a Spark environment, allowing you to interactively develop and test your code using tools like Jupyter notebooks or a Zeppelin notebook connected to your AWS Glue environment. By using development endpoints, you can ensure your ETL code is functional and efficient before deploying it as a serverless job, resulting in a faster and more reliable ETL process.

Question 6

AWS Glue DataBrew

Accepted Answer

AWS Glue DataBrew is a visual data preparation tool that allows you to clean and normalize your data for analysis and machine learning. With DataBrew, you can explore and experiment with your data by applying transformations, aggregating, and filtering the data with a variety of operations, all without writing any code. This enables you to identify and prepare data for various analytics and ML use cases, helping ensure that the data is accurate, up-to-date, and properly formatted. DataBrew seamlessly integrates with other AWS services, such as S3, Redshift, and RDS.

Question 7

AWS Glue Data Sink

Accepted Answer

AWS Glue Data Sink represents the destination of your processed data in your AWS Glue ETL jobs. Data Sink configuration defines where and how the output data should be written. This can be set to various AWS services such as Amazon S3, Amazon Redshift, or Amazon RDS, depending on where you would like to store the transformed data. In addition to specifying the destination, you can configure different options related to data format, partitioning, and compression to optimize storage and query performance.

Question 8

AWS Glue Partitions

Accepted Answer

Data partitioning in AWS Glue enables you to divide your data into smaller, more manageable pieces, which can help improve query performance, reduce costs, and optimize storage. Partitions are created based on one or more columns in a table, allowing for efficient filtering of data when querying. For example, partitioning a sales dataset based on date allows you to query data for a specific day without scanning the entire dataset. AWS Glue can automatically discover and maintain partitions in your datasets as part of the crawler process, simplifying the management of your data catalog.

Question 9

AWS Glue Data Transformation

Accepted Answer

Data transformation in AWS Glue involves using AWS Glue ETL jobs to process, convert, and reshape your source data into the desired format and structure. This can involve tasks like extracting data from various sources, applying data cleansing, mapping and enrichment operations, and loading the transformed data into a target data store. AWS Glue ETL jobs are authored using Python or Scala programming languages and leverage built-in Glue libraries, known as Glue PySpark and Glue Scala libraries, to perform complex data transformations with ease. This process helps ensure that the data is accurate and consistent across all your analytics and machine learning workloads.

Question 10

AWS Glue Security and Access Control

Accepted Answer

AWS Glue provides various security and access control features to help protect your data and ensure compliance. These features include encryption at rest and in transit, fine-grained access control using AWS Identity and Access Management (IAM), and audit logging through AWS CloudTrail. With encryption, you can protect data at different levels, whether it is stored in S3 or when it is processed by AWS Glue ETL jobs. IAM allows you to define policies that specify which actions a user or group can perform on specific AWS Glue resources. AWS Glue also integrates with AWS Lake Formation to provide granular table and column-level access control for your data catalog. CloudTrail helps you monitor and audit AWS Glue API calls for visibility and compliance purposes.

Learn AWS Glue (AWS Certified Solutions Architect) with Interactive Flashcards