Azure Data Factory: Complete Guide for DP-900
Azure Data Factory (ADF) is Microsoft's cloud-based data integration service that allows you to create, schedule, and orchestrate data workflows (called pipelines) at scale. It is a critical component of the analytics workload on Azure and a key topic in the DP-900 exam.
Why Is Azure Data Factory Important?
In modern data environments, organizations deal with data from many different sources — on-premises databases, cloud storage, SaaS applications, IoT devices, and more. Azure Data Factory is important because it:
• Enables Extract, Transform, and Load (ETL) and Extract, Load, and Transform (ELT) processes at cloud scale
• Connects to 90+ data sources, both on-premises and in the cloud
• Provides a code-free or low-code visual interface for building data pipelines
• Allows organizations to automate data movement and transformation, reducing manual effort and errors
• Supports hybrid data integration, meaning it can move data between on-premises and cloud environments seamlessly
• Is a serverless service, meaning you pay only for what you use and don't need to manage infrastructure
What Is Azure Data Factory?
Azure Data Factory is a cloud-based data integration service that orchestrates the movement and transformation of data. Think of it as a conductor in an orchestra — it doesn't necessarily perform the data transformations itself but coordinates and manages the entire data workflow.
Key concepts in Azure Data Factory include:
1. Pipelines: A pipeline is a logical grouping of activities that together perform a task. For example, a pipeline might contain activities that ingest data from Azure Blob Storage, transform it using a data flow, and then load it into Azure Synapse Analytics.
2. Activities: Activities represent a processing step in a pipeline. There are three types of activities:
• Data movement activities — such as the Copy Activity, which moves data from a source to a destination
• Data transformation activities — such as Data Flow, HDInsight Hive, or Databricks Notebook activities
• Control activities — such as If Condition, ForEach, and Wait, which control the flow of execution
3. Datasets: A dataset is a named view of data that points to the data you want to use in your activities. It defines the structure of the data within a data store.
4. Linked Services: Linked services are like connection strings. They define the connection information needed for ADF to connect to external resources (e.g., an Azure SQL Database connection string or an Azure Blob Storage account key).
5. Triggers: Triggers determine when a pipeline execution should begin. They can be schedule-based (run at a specific time), tumbling window-based (run at periodic intervals), or event-based (triggered by an event like a file arriving in blob storage).
6. Data Flows: Data flows are visually designed data transformation logic within ADF. Mapping data flows allow you to build transformation logic without writing code. These run on Apache Spark clusters managed by ADF behind the scenes.
7. Integration Runtime (IR): The Integration Runtime is the compute infrastructure used by ADF. There are three types:
• Azure IR — for cloud-to-cloud data movement and transformation
• Self-hosted IR — for hybrid scenarios involving on-premises data sources
• Azure-SSIS IR — for running existing SQL Server Integration Services (SSIS) packages in the cloud
How Does Azure Data Factory Work?
The typical workflow in Azure Data Factory follows these steps:
Step 1 — Connect: You create linked services to connect to your data sources and destinations (e.g., Azure Blob Storage, Azure SQL Database, on-premises SQL Server).
Step 2 — Define: You define datasets that represent the data structures in your connected data stores.
Step 3 — Build: You build pipelines that contain one or more activities. These activities use the datasets and linked services to move and transform data.
Step 4 — Transform: Within the pipeline, you can use mapping data flows for code-free transformations, or you can call external compute services like Azure Databricks, Azure HDInsight, or Azure Synapse Analytics for more complex transformations.
Step 5 — Schedule and Monitor: You create triggers to schedule pipeline runs. ADF provides a monitoring interface where you can track pipeline runs, view activity logs, and set up alerts for failures.
Key Features to Remember:
• ADF is primarily an orchestration and data integration service — it coordinates data workflows
• ADF supports both ETL and ELT patterns
• ADF is serverless — no infrastructure to manage
• ADF can handle hybrid scenarios using the Self-hosted Integration Runtime
• The Copy Activity is the most commonly used activity for moving data between stores
• Mapping Data Flows provide a visual, code-free way to transform data at scale using Spark
• ADF integrates tightly with other Azure services like Azure Synapse Analytics, Azure Databricks, Azure Data Lake Storage, and Azure SQL Database
ADF vs. Azure Synapse Analytics Pipelines:
Azure Synapse Analytics includes pipeline functionality that is very similar to Azure Data Factory. In fact, Synapse pipelines are built on the same technology as ADF. The key difference is that Synapse pipelines are integrated within the Azure Synapse workspace, while ADF is a standalone service. For the DP-900 exam, understand that both can be used for data orchestration and integration.
Exam Tips: Answering Questions on Azure Data FactoryTip 1: Remember that ADF is primarily about
data integration and orchestration. If a question asks about moving data from one place to another or orchestrating a data workflow, ADF is likely the correct answer.
Tip 2: Know the difference between ADF and other services. ADF is
not a data storage solution (that's Azure Data Lake or Blob Storage). ADF is
not a data analytics engine (that's Azure Synapse, Databricks, or HDInsight). ADF
orchestrates and
coordinates these services.
Tip 3: If a question mentions
moving data from on-premises to the cloud, think about the
Self-hosted Integration Runtime. This is the component that enables ADF to access data behind a firewall or in a private network.
Tip 4: Understand the
Copy Activity. It is the primary activity used for data movement in ADF. If a question asks how to copy data from one data store to another, the Copy Activity in ADF is the answer.
Tip 5: Questions may reference
ETL vs. ELT. Remember that ADF supports both. In an ETL scenario, data is transformed before loading. In an ELT scenario, data is loaded first and then transformed in the destination (e.g., using Azure Synapse). ADF can orchestrate both patterns.
Tip 6: If a question mentions
code-free data transformation within ADF, the answer is
Mapping Data Flows. These use a visual designer and run on managed Spark clusters.
Tip 7: Know that ADF uses
triggers to schedule pipeline execution. Be familiar with the types: schedule triggers, tumbling window triggers, and event-based triggers.
Tip 8: Watch for questions about
SSIS migration. If a question mentions lifting and shifting existing SSIS packages to the cloud, the answer is the
Azure-SSIS Integration Runtime in ADF.
Tip 9: Remember the key terminology:
Pipelines contain
Activities. Activities use
Datasets and
Linked Services. This hierarchy is frequently tested.
Tip 10: ADF is a
pay-per-use, serverless service. If a question asks about a cost-effective, scalable, and managed data integration solution, ADF fits the description.
Quick Summary for Exam Day:Azure Data Factory =
Cloud-based ETL/ELT orchestration service → Pipelines → Activities → Datasets → Linked Services → Triggers → Integration Runtimes. It
moves and transforms data across 90+ sources, supports hybrid connectivity via Self-hosted IR, and provides code-free transformations through Mapping Data Flows.