Microsoft Purview Data Lineage
Microsoft Purview Data Lineage is a critical feature within Microsoft Purview (formerly Azure Purview) that provides a comprehensive visual representation of how data moves, transforms, and flows across an organization's entire data estate. It is essential for Azure Data Engineers designing and imp… Microsoft Purview Data Lineage is a critical feature within Microsoft Purview (formerly Azure Purview) that provides a comprehensive visual representation of how data moves, transforms, and flows across an organization's entire data estate. It is essential for Azure Data Engineers designing and implementing data storage solutions. Data lineage in Microsoft Purview automatically captures and maps the end-to-end journey of data from its source systems through various transformations to its final destination. This includes tracking data as it moves through Azure Data Factory pipelines, Azure Synapse Analytics, SQL databases, Power BI reports, and other supported services. Key aspects of Microsoft Purview Data Lineage include: 1. **Automated Lineage Capture**: Purview automatically extracts lineage metadata from supported systems without requiring manual documentation. When data pipelines run in Azure Data Factory or Synapse, lineage is captured at both the dataset and column level. 2. **Column-Level Lineage**: Engineers can trace how individual columns are mapped, transformed, or derived across different datasets, enabling precise impact analysis. 3. **Visual Lineage Graph**: Purview provides an interactive graphical interface showing upstream sources, transformation processes, and downstream consumers, making it easy to understand complex data flows. 4. **Impact Analysis**: When changes are planned for a data source or schema, lineage helps engineers assess which downstream systems, reports, and processes will be affected. 5. **Regulatory Compliance**: Data lineage supports governance requirements by providing audit trails showing where sensitive data originates and how it is consumed, which is vital for GDPR, HIPAA, and other regulations. 6. **Cross-System Visibility**: Lineage spans across multiple Azure services and even on-premises or multi-cloud environments, providing a unified view. For Azure Data Engineers, understanding data lineage is crucial when designing storage solutions because it ensures data traceability, supports debugging of data quality issues, facilitates governance, and enables informed decision-making when modifying data architectures. It bridges the gap between data producers and consumers across the organization.
Microsoft Purview Data Lineage: A Comprehensive Guide for DP-203
Microsoft Purview Data Lineage: A Comprehensive Guide for the Azure Data Engineer DP-203 Exam
Why Is Microsoft Purview Data Lineage Important?
In modern data ecosystems, data flows through numerous systems, transformations, and storage layers before it reaches its final destination. Understanding where data originates, how it is transformed, and where it ends up is critical for several reasons:
• Regulatory Compliance: Organizations must demonstrate how data moves through their systems to meet regulatory requirements such as GDPR, HIPAA, and SOX. Data lineage provides an auditable trail of data movement.
• Data Quality and Trust: When analysts and data scientists consume data, they need confidence that the data is accurate. Lineage helps them trace issues back to the source, improving data quality and trust.
• Impact Analysis: Before modifying a data pipeline or schema, engineers need to understand downstream dependencies. Lineage provides this visibility, preventing unintended breakages.
• Root Cause Analysis: When data discrepancies arise, lineage enables rapid identification of where in the pipeline the issue occurred.
• Data Governance: Lineage is a foundational component of any governance strategy. It connects data assets, transformations, and consumers in a unified view.
What Is Microsoft Purview Data Lineage?
Microsoft Purview (formerly Azure Purview) is a unified data governance service that helps organizations manage and govern their on-premises, multi-cloud, and SaaS data. Data lineage is one of the core features of Microsoft Purview, providing a visual representation of how data flows from source to destination across your entire data estate.
Key characteristics of Purview Data Lineage:
• End-to-End Visibility: It captures and displays the complete journey of data, from ingestion through transformation to consumption.
• Automated Lineage Extraction: Purview automatically captures lineage from supported systems and services without requiring manual documentation.
• Visual Representation: Lineage is displayed as a directed graph in the Purview portal, showing sources, transformations (processes), and destinations as connected nodes.
• Column-Level Lineage: For supported connectors and activities, Purview can track lineage at the column level, showing exactly which source columns map to which destination columns.
• Integration with Azure Services: Purview natively integrates with Azure Data Factory, Azure Synapse Analytics, Azure SQL Database, Azure Data Lake Storage, Power BI, and many other services.
How Does Microsoft Purview Data Lineage Work?
1. Data Source Registration and Scanning
The first step is to register your data sources in Purview. Supported sources include Azure Data Lake Storage, Azure SQL Database, Azure Synapse Analytics, SQL Server, Amazon S3, and many more. Once registered, Purview scans these sources to discover and classify data assets, building a comprehensive catalog.
2. Automated Lineage Capture
Purview captures lineage automatically from integrated services. The primary mechanisms include:
• Azure Data Factory (ADF) and Azure Synapse Pipelines: When ADF or Synapse pipelines execute copy activities, data flows, or mapping data flows, Purview automatically captures the lineage. This is achieved through a native connection between ADF/Synapse and Purview. You must connect your ADF or Synapse workspace to your Purview account to enable this.
• Power BI: Purview captures lineage from Power BI datasets, reports, and dashboards, showing how data flows from storage into Power BI models and visualizations.
• Azure SQL and Synapse SQL: Purview can capture lineage from stored procedures and SQL transformations executed within these environments.
• Apache Atlas and Open Lineage: Purview supports Apache Atlas hooks and OpenLineage standards, enabling lineage capture from Spark jobs and other open-source frameworks.
3. Lineage Visualization
Once lineage is captured, it is displayed in the Purview Data Catalog. When you navigate to any asset in the catalog, you can click on the Lineage tab to see:
• Upstream sources: Where the data originates
• Processes/Transformations: The activities or operations that transform the data (e.g., ADF Copy Activity, Data Flow, stored procedures)
• Downstream destinations: Where the data is consumed or stored
The visualization uses a graph format with nodes (representing data assets and processes) connected by directed edges (representing data flow direction).
4. Column-Level Lineage
For certain activities (such as ADF Mapping Data Flows), Purview provides column-level lineage. This means you can drill into a specific table or dataset and see exactly which columns were derived from which source columns, including any transformations applied. This is extremely valuable for debugging and compliance.
5. Lineage from Custom Sources (Push Lineage via API)
For systems not natively supported, Purview provides REST APIs (based on Apache Atlas) to push custom lineage information. This allows organizations to capture lineage from on-premises ETL tools, custom applications, or third-party systems.
Supported Services for Automatic Lineage Capture
The following Azure services support automatic lineage capture in Microsoft Purview:
• Azure Data Factory – Copy Activity, Data Flow, Execute SSIS Package
• Azure Synapse Analytics – Synapse Pipelines, Synapse Spark, Dedicated SQL Pools
• Power BI – Datasets, Reports, Dashboards, Dataflows
• Azure Data Share – Shared datasets
• Azure SQL Database – Stored procedures (with lineage extraction enabled)
• Azure Machine Learning – ML pipeline lineage
Key Concepts to Remember
• Assets: Individual data objects such as tables, files, or datasets registered in the Purview catalog.
• Processes: Activities or operations that move or transform data (e.g., ADF Copy Activity). These appear as process nodes in the lineage graph.
• Lineage Graph: The visual directed acyclic graph (DAG) that represents data flow from sources through processes to destinations.
• Collection: A logical grouping in Purview used to organize assets and manage access control. Lineage is captured regardless of collection membership.
• Managed Attributes and Classifications: Additional metadata that enriches lineage with context, such as sensitivity labels or business glossary terms.
Setting Up Lineage Capture with Azure Data Factory
This is a common exam scenario. Here are the steps:
1. Create or use an existing Microsoft Purview account.
2. In the Azure Data Factory portal, go to Manage → Microsoft Purview and connect your ADF instance to your Purview account.
3. Ensure the ADF managed identity has the necessary Data Curator role in Purview.
4. Run your ADF pipelines as normal. Lineage will be automatically pushed to Purview after pipeline execution.
5. Navigate to the Purview portal, find your assets, and view the Lineage tab.
Permissions and Access Control
• To view lineage, users need at least Data Reader role on the relevant collection in Purview.
• To push lineage (from ADF or via API), the service principal or managed identity needs Data Curator role.
• The Purview root collection administrator manages overall access.
Limitations to Be Aware Of
• Lineage is captured after a pipeline runs successfully. Failed runs may not produce complete lineage.
• Not all ADF activities support lineage (e.g., Lookup, Get Metadata activities do not generate lineage).
• Column-level lineage is available primarily for Mapping Data Flows, not for all Copy Activities.
• There may be a delay between pipeline execution and lineage appearing in Purview.
• On-premises data sources require a Self-hosted Integration Runtime for scanning but lineage is still captured via ADF pipeline execution.
Exam Tips: Answering Questions on Microsoft Purview Data Lineage
Tip 1: Know Which Services Automatically Capture Lineage
The exam frequently tests whether you know which services push lineage to Purview automatically. Remember: Azure Data Factory, Azure Synapse Analytics, and Power BI are the primary services. If a question asks about capturing lineage from a custom or unsupported source, the answer will involve the Purview REST API (Apache Atlas API).
Tip 2: Understand the ADF-Purview Connection
Questions often ask how to enable lineage from ADF to Purview. The key steps are: connect ADF to Purview through the ADF Manage hub, and assign the Data Curator role to the ADF managed identity. If you see a question about lineage not appearing, check whether the managed identity has the correct permissions.
Tip 3: Differentiate Between Scanning and Lineage
Scanning discovers and classifies data assets (schema, classifications, metadata). Lineage captures the flow and transformation of data. These are two separate concepts. A scan does not produce lineage by itself. Lineage comes from pipeline executions and connected services.
Tip 4: Column-Level Lineage
If a question specifically asks about tracking transformations at the column level, the answer typically involves ADF Mapping Data Flows or Synapse Data Flows. Copy Activities may provide asset-level lineage but not always detailed column-level lineage.
Tip 5: Lineage for Governance and Compliance Scenarios
When an exam question describes a scenario requiring audit trails, regulatory compliance, or impact analysis, the answer is almost always Microsoft Purview Data Lineage. Look for keywords like: trace data origin, track data transformations, understand data flow, impact analysis, or audit trail.
Tip 6: Know the Role of Apache Atlas
Purview's lineage model is built on Apache Atlas. If a question mentions pushing custom lineage or integrating with open-source tools, the answer involves the Apache Atlas-compatible REST APIs in Purview.
Tip 7: Distinguish Between Purview and Other Governance Tools
If a question presents options like Azure Policy, Azure Monitor, Microsoft Purview, or Azure Advisor, remember that only Microsoft Purview provides data lineage and data catalog capabilities. Azure Policy handles resource compliance, Azure Monitor handles operational monitoring, and Azure Advisor provides optimization recommendations.
Tip 8: Pay Attention to Access Control Questions
If an exam question asks about who can view lineage, the answer involves Purview collection-level roles. Data Reader can view; Data Curator can edit and push lineage; Collection Admin manages access. The exam may test whether you know the minimum role required for a specific action.
Tip 9: Remember Lineage Requires Successful Pipeline Execution
Lineage data is generated after pipelines complete execution. If a question mentions that lineage is missing, consider whether the pipeline actually ran successfully and whether there is a propagation delay.
Tip 10: Integration with Power BI
Purview can show end-to-end lineage from a data source through ADF transformations to a Power BI report. This is a common exam scenario. If a question asks how to trace data from source to a Power BI dashboard, the answer is to use Purview with both ADF and Power BI connected to the same Purview account.
Summary
Microsoft Purview Data Lineage is a critical capability for Azure Data Engineers. It provides automated, visual tracking of data as it moves through your Azure ecosystem. For the DP-203 exam, focus on understanding which services support automatic lineage capture, how to configure the ADF-to-Purview connection, the difference between scanning and lineage, and the role-based access model. When you see exam questions about data governance, compliance, impact analysis, or tracing data flow, Microsoft Purview Data Lineage should be your go-to answer.
Unlock Premium Access
Azure Data Engineer Associate + ALL Certifications
- Access to ALL Certifications: Study for any certification on our platform with one subscription
- 1680 Superior-grade Azure Data Engineer Associate practice questions
- Unlimited practice tests across all certifications
- Detailed explanations for every question
- DP-203: 5 full exams plus all other certification exams
- 100% Satisfaction Guaranteed: Full refund if unsatisfied
- Risk-Free: 7-day free trial with all premium features!