Microsoft Purview Data Catalog
Microsoft Purview Data Catalog is a unified data governance service that helps organizations discover, understand, and manage their data assets across on-premises, multi-cloud, and SaaS environments. As a critical component of Azure data engineering, it plays a vital role in designing and implement… Microsoft Purview Data Catalog is a unified data governance service that helps organizations discover, understand, and manage their data assets across on-premises, multi-cloud, and SaaS environments. As a critical component of Azure data engineering, it plays a vital role in designing and implementing data storage solutions. **Key Features:** 1. **Automated Data Discovery:** Purview Data Catalog automatically scans and registers data assets from various sources, including Azure Data Lake Storage, Azure SQL Database, Azure Synapse Analytics, SQL Server, Amazon S3, and more. This creates a comprehensive map of your data estate. 2. **Data Classification and Labeling:** It automatically classifies sensitive data using built-in and custom classification rules, identifying patterns like Social Security numbers, credit card numbers, and other sensitive information. This supports compliance with regulations such as GDPR and HIPAA. 3. **Data Lineage:** Purview provides end-to-end data lineage tracking, showing how data moves and transforms across systems. This is especially valuable for understanding data pipelines built with Azure Data Factory or Synapse Pipelines. 4. **Business Glossary:** Organizations can define standardized business terms and map them to technical data assets, bridging the gap between technical and business stakeholders. 5. **Search and Browse:** Users can easily search and discover data assets using a rich search experience with filters, classifications, and glossary terms. **Relevance to Data Storage Design:** When designing data storage solutions, Purview helps engineers understand existing data assets, avoid duplication, ensure proper governance, and maintain compliance. It integrates natively with Azure storage services, making it essential for managing metadata across data lakes, warehouses, and databases. **Integration Points:** - Azure Data Factory for lineage tracking - Azure Synapse Analytics for governed data access - Azure Data Lake Storage for scanning and classification - Power BI for dataset governance Purview Data Catalog ensures that data storage implementations are well-governed, discoverable, and compliant with organizational and regulatory standards.
Microsoft Purview Data Catalog: Complete Guide for DP-203
Microsoft Purview Data Catalog
Why Is Microsoft Purview Data Catalog Important?
In modern data environments, organizations deal with vast amounts of data spread across multiple storage systems, databases, and cloud services. Without a centralized way to discover, understand, and govern this data, teams face significant challenges:
- Data Silos: Teams cannot find or access data owned by other departments.
- Lack of Trust: Without understanding data lineage and quality, analysts cannot trust the data they consume.
- Compliance Risks: Organizations must classify and protect sensitive data (PII, financial records, health data) to comply with regulations like GDPR, HIPAA, and CCPA.
- Duplicate Efforts: Without a catalog, multiple teams may recreate the same datasets independently, wasting resources.
Microsoft Purview Data Catalog addresses all of these challenges by providing a unified data governance solution that enables data discovery, classification, lineage tracking, and access management across the entire data estate.
What Is Microsoft Purview Data Catalog?
Microsoft Purview (formerly Azure Purview) is a unified data governance service that helps organizations manage and govern their on-premises, multi-cloud, and SaaS data. The Data Catalog is a core component of Microsoft Purview that provides:
- Automated Data Discovery: Scans and registers data assets from various sources automatically.
- Data Classification: Automatically classifies data using built-in and custom classification rules (e.g., identifying Social Security Numbers, credit card numbers, email addresses).
- Data Lineage: Visualizes how data flows from source to destination across pipelines, transformations, and reports.
- Business Glossary: Allows organizations to define standard business terms and map them to technical data assets.
- Search and Browse: Provides a searchable catalog where data consumers can find relevant datasets.
- Access Policies: Enables centralized data access governance.
Key Components of Microsoft Purview Data Catalog:
1. Data Map: The foundation layer that scans, discovers, and classifies data assets. It captures metadata and creates a map of your entire data estate.
2. Data Catalog: The consumer-facing experience where business users and data professionals search for, discover, and understand data assets.
3. Data Estate Insights: Provides dashboards and reports about the health and governance status of your data estate.
4. Data Sharing: Enables in-place data sharing across organizations without data duplication.
5. Data Policy: Centralized access management for data assets.
How Does Microsoft Purview Data Catalog Work?
Step 1: Register Data Sources
You register the data sources you want to scan. Supported sources include:
- Azure Data Lake Storage (Gen1 and Gen2)
- Azure Blob Storage
- Azure SQL Database and Azure Synapse Analytics
- Azure Cosmos DB
- Power BI
- SQL Server (on-premises)
- Amazon S3, Google BigQuery, and other multi-cloud sources
- SAP, Teradata, Oracle, and many more
Step 2: Scan Data Assets
Once registered, you configure and run scans. During scanning:
- Purview connects to the data source using a managed identity, service principal, or stored credentials in Azure Key Vault.
- For sources behind firewalls, a Self-Hosted Integration Runtime (SHIR) is used.
- Metadata (schema, column names, data types) is extracted — not the actual data itself.
- Scans can be scheduled to run on a recurring basis (daily, weekly, monthly).
Step 3: Automatic Classification
During scanning, Purview applies system classification rules that detect sensitive data patterns such as:
- Credit card numbers
- Social Security Numbers (SSN)
- Email addresses
- Passport numbers
- Custom patterns defined by the organization using custom classification rules (regex-based or dictionary-based).
Step 4: Curate with Business Glossary
Data stewards create glossary terms that represent business concepts (e.g., "Customer ID", "Revenue", "Churn Rate"). These terms are then linked to specific data assets and columns, bridging the gap between technical metadata and business understanding.
Step 5: Data Lineage Tracking
Purview automatically captures lineage from:
- Azure Data Factory (ADF) pipelines — Copy activities, Data Flows, and mapping data flows
- Azure Synapse Analytics pipelines
- Power BI reports and datasets
- Azure Databricks (with OpenLineage integration)
- SQL Server Integration Services (SSIS)
Lineage shows the complete journey of data: where it originated, how it was transformed, and where it is consumed.
Step 6: Search and Discover
Data consumers use the Purview portal to:
- Search for data assets by name, classification, glossary term, or owner.
- Browse the data catalog by source type or collection.
- View asset details including schema, classifications, lineage, and contacts.
- Request access to data assets through integrated workflows.
Step 7: Data Access Policies (Preview/GA features)
Purview can enforce data access policies directly on supported sources like Azure Storage and Azure SQL, enabling centralized governance without configuring permissions at each individual data source.
Key Concepts for DP-203:
Collections:
Collections are organizational units in Purview that help manage access control and organize data sources. They form a hierarchy, and permissions are inherited down the collection tree. Collections control:
- Who can register and scan sources
- Who can view and curate assets
- Who can manage the collection
Roles in Purview:
- Collection Admin: Manages the collection and assigns roles.
- Data Source Admin: Registers data sources and manages scans.
- Data Curator: Edits asset metadata, manages glossary terms, and annotates assets.
- Data Reader: Read-only access to browse and search the catalog.
- Insights Reader: Access to data estate insights dashboards.
Integration with Azure Data Engineering Services:
- Azure Data Factory: Purview integrates natively with ADF. When you connect an ADF instance to Purview, lineage is automatically pushed to the catalog for all pipeline runs. This is critical for DP-203.
- Azure Synapse Analytics: Synapse workspaces can be connected to Purview, allowing users to search the Purview catalog directly from within Synapse Studio. Synapse pipelines also push lineage to Purview.
- Azure Data Lake Storage: Purview scans ADLS Gen2 to discover and classify files (Parquet, CSV, JSON, Delta Lake).
- Power BI: Purview scans Power BI tenants to catalog datasets, reports, and dashboards, and tracks lineage from data source to Power BI report.
Sensitivity Labels:
Microsoft Purview can extend Microsoft Information Protection sensitivity labels to data assets. This means the same labels used in Microsoft 365 (e.g., "Confidential", "Highly Confidential") can be applied to files in Azure Data Lake or columns in Azure SQL.
Resource Set Rules:
When scanning data lakes, Purview automatically groups partitioned files (e.g., /data/year=2023/month=01/*.parquet) into a single resource set rather than cataloging each individual file. Custom resource set rules can be configured to control this grouping behavior.
Exam Tips: Answering Questions on Microsoft Purview Data Catalog
1. Know the Difference Between Data Map, Data Catalog, and Data Estate Insights: The Data Map is the foundational scanning and classification layer. The Data Catalog is the user-facing search and discovery experience. Data Estate Insights provides governance reports. Exam questions may test whether you understand which component handles which responsibility.
2. Understand Lineage Sources: Remember that lineage is automatically captured from Azure Data Factory, Azure Synapse pipelines, and Power BI. For Azure Databricks, additional configuration (OpenLineage connector) is needed. If a question asks how to track data lineage across pipelines, Purview + ADF integration is typically the answer.
3. Self-Hosted Integration Runtime (SHIR): If a question mentions scanning on-premises data sources (like on-premises SQL Server) or sources behind a firewall, the answer involves deploying a Self-Hosted Integration Runtime. This is a frequent exam topic.
4. Classification vs. Sensitivity Labels: Classifications are applied by Purview during scanning based on data patterns. Sensitivity labels are from Microsoft Information Protection and represent organizational policies about data handling. Both can coexist. Know the distinction.
5. Managed Identity for Authentication: When Purview scans Azure-native resources (Azure SQL, ADLS Gen2, Synapse), it typically uses its system-assigned managed identity. The managed identity must be granted appropriate permissions on the target resource (e.g., Storage Blob Data Reader for ADLS Gen2). Expect questions on this.
6. Business Glossary Purpose: If a question asks about creating a common vocabulary or mapping business terms to technical assets, the answer is the Business Glossary in Purview.
7. Collections and Access Control: Questions about organizing data governance or controlling who can scan versus who can curate will involve collections and Purview roles. Remember the hierarchy: Collection Admin > Data Source Admin > Data Curator > Data Reader.
8. Purview Does NOT Store Data: A common trick question — Purview only stores metadata, not the actual data. It scans and extracts metadata, schemas, and classification results. The data stays in its original location.
9. Synapse Studio Integration: If a question asks about searching for data assets from within Synapse Studio, the answer is connecting the Synapse workspace to a Purview account. This enables the Purview search bar inside Synapse.
10. Resource Sets for Data Lakes: If a question describes a scenario with thousands of partitioned files in a data lake and asks how Purview handles them, the answer is resource sets — Purview groups them automatically into a single logical asset.
11. Scan Scheduling: Purview supports scheduling scans at regular intervals. If a question asks about keeping the catalog up to date as new data arrives, the answer is to configure a recurring scan schedule.
12. Key Vault for Credentials: When Purview needs to use credentials (username/password, service principal secrets) for scanning non-Azure or specific data sources, it retrieves them from Azure Key Vault. This is a security best practice that may appear in exam scenarios.
13. Watch for Distractors: Azure Data Catalog (the older, now-deprecated service) may appear as a distractor. The current, supported solution is Microsoft Purview (formerly Azure Purview). Always choose Microsoft Purview over the legacy Azure Data Catalog.
14. Scenario-Based Questions: When you see a scenario involving data governance, data discovery, data classification, or data lineage tracking, think Microsoft Purview first. If the scenario involves restricting access to sensitive data at the column or row level, that might involve dynamic data masking or row-level security in the data source itself, not Purview directly.
15. Cost Awareness: Purview pricing is based on Data Map capacity units (for scanning and storage of metadata) and Data Estate Insights. While detailed pricing is unlikely on the exam, knowing that scanning frequency and volume of metadata affect costs may help in scenario-based optimization questions.
Unlock Premium Access
Azure Data Engineer Associate + ALL Certifications
- Access to ALL Certifications: Study for any certification on our platform with one subscription
- 1680 Superior-grade Azure Data Engineer Associate practice questions
- Unlimited practice tests across all certifications
- Detailed explanations for every question
- DP-203: 5 full exams plus all other certification exams
- 100% Satisfaction Guaranteed: Full refund if unsatisfied
- Risk-Free: 7-day free trial with all premium features!