Learn Describe Considerations for Working with Non-Relational Data on Azure (DP-900) with Interactive Flashcards
Master key concepts in Describe Considerations for Working with Non-Relational Data on Azure through our interactive flashcard system. Click on each card to reveal detailed explanations and enhance your understanding.
Azure Blob Storage
Azure Blob Storage is a massively scalable object storage solution provided by Microsoft Azure, designed to store large amounts of unstructured, non-relational data. Unstructured data refers to data that does not adhere to a specific data model or schema, such as text files, images, videos, audio files, backups, logs, and binary data.
Blob stands for Binary Large Object, and Azure Blob Storage organizes data into three key components: Storage Accounts, Containers, and Blobs. A Storage Account provides a unique namespace in Azure for your data. Within a storage account, Containers act as logical groupings, similar to folders, that help organize blobs. Blobs are the actual data objects stored within containers.
Azure Blob Storage supports three types of blobs: Block Blobs, which are optimized for uploading large amounts of data efficiently and are ideal for storing documents, media files, and backups; Append Blobs, which are optimized for append operations, making them suitable for logging scenarios; and Page Blobs, which are designed for random read and write operations and are commonly used for virtual hard disk (VHD) files.
Blob Storage offers multiple access tiers to help manage costs based on data usage patterns. The Hot tier is for frequently accessed data, the Cool tier is for infrequently accessed data stored for at least 30 days, the Cold tier is for data stored for at least 90 days, and the Archive tier is for rarely accessed data stored for at least 180 days with flexible latency requirements.
Key features include high availability, redundancy options (LRS, GRS, ZRS, RA-GRS), security through encryption at rest and in transit, and integration with other Azure services. Azure Blob Storage is accessible via REST APIs, Azure SDKs, Azure CLI, and PowerShell, making it versatile for developers and administrators. It is an ideal solution for serving content, data analytics, disaster recovery, and archiving scenarios.
Blob Storage Access Tiers and Lifecycle
Azure Blob Storage offers multiple access tiers designed to optimize costs based on how frequently data is accessed. There are three main access tiers:
1. **Hot Tier**: Optimized for data that is accessed frequently. It has the highest storage costs but the lowest access costs. This is ideal for data in active use, such as images on a website or documents being regularly processed.
2. **Cool Tier**: Suited for data that is infrequently accessed and stored for at least 30 days. It offers lower storage costs compared to the Hot tier but has higher access costs. Examples include short-term backups and older datasets that are occasionally referenced.
3. **Archive Tier**: Designed for data that is rarely accessed and stored for at least 180 days. It has the lowest storage cost but the highest access cost and latency, as data must be rehydrated before it can be read. This tier is ideal for long-term backups, compliance archives, and historical data.
Additionally, Azure provides a **Cold Tier**, which sits between Cool and Archive, optimized for data stored for at least 90 days with infrequent access.
**Lifecycle Management** in Azure Blob Storage allows you to define rule-based policies to automatically transition blobs between access tiers or delete them based on specified conditions. For example, you can create a rule that moves a blob from Hot to Cool after 30 days of no access, then to Archive after 90 days, and finally deletes it after 365 days. These policies help organizations reduce costs by ensuring data is stored in the most cost-effective tier throughout its lifecycle.
Lifecycle management rules are defined at the storage account level and can be applied based on blob age, last access time, or creation date. This automation eliminates the need for manual data management, ensuring efficient storage utilization while maintaining compliance and data retention requirements.
Azure File Storage
Azure File Storage is a fully managed cloud-based file sharing service offered by Microsoft Azure that enables organizations to create and manage file shares accessible via the industry-standard Server Message Block (SMB) protocol, Network File System (NFS) protocol, and Azure Files REST API. It is designed to replace or supplement traditional on-premises file servers and network-attached storage (NAS) devices.
Key features of Azure File Storage include:
1. **Fully Managed Service**: Azure handles hardware maintenance, OS updates, and security patches, eliminating the need for managing physical infrastructure.
2. **Cross-Platform Access**: File shares can be mounted concurrently by cloud or on-premises deployments of Windows, Linux, and macOS, making it highly versatile across different environments.
3. **Shared Access**: Multiple virtual machines or applications can read and write to the same file share simultaneously, enabling collaborative scenarios and shared data access.
4. **Azure File Sync**: This feature allows organizations to cache Azure file shares on Windows Server, enabling fast local access while maintaining centralized storage in the cloud. This is ideal for lift-and-shift scenarios.
5. **Storage Tiers**: Azure File Storage offers multiple tiers—Premium (SSD-backed), Transaction Optimized, Hot, and Cool—allowing users to balance performance and cost based on workload requirements.
6. **Security**: Data is encrypted at rest and in transit. Access can be controlled through Azure Active Directory, shared access signatures, and storage account keys.
7. **Snapshots**: Azure Files supports share snapshots, providing point-in-time read-only copies of data for backup and recovery purposes.
Common use cases include replacing traditional file servers, storing shared application settings and configuration files, diagnostic data logging, and facilitating development tool sharing across teams. It is particularly useful for applications that rely on file system APIs and need shared storage without code modifications.
Azure File Storage integrates seamlessly with other Azure services and is billed based on provisioned or consumed storage capacity, transactions, and data transfer, making it a scalable and cost-effective non-relational data storage solution.
Azure Table Storage
Azure Table Storage is a NoSQL key-value data store within Microsoft Azure that allows you to store large amounts of structured, non-relational data in the cloud. It is part of the Azure Storage account services and is designed for scalability, flexibility, and cost-effectiveness.
At its core, Azure Table Storage organizes data into tables, which contain entities (similar to rows in a relational database). Each entity consists of a set of properties (similar to columns), and every entity must include three system properties: a PartitionKey, a RowKey, and a Timestamp. The combination of PartitionKey and RowKey uniquely identifies each entity within a table.
The PartitionKey is particularly important because it determines how data is distributed across storage partitions. Choosing an effective PartitionKey ensures optimal performance and scalability by enabling Azure to efficiently load-balance data across multiple servers. The RowKey serves as a unique identifier within a given partition.
One of the key advantages of Azure Table Storage is its schema-less design. Unlike relational databases, entities within the same table can have different sets of properties. This flexibility makes it ideal for storing diverse datasets where the structure may vary across records.
Azure Table Storage is well-suited for scenarios such as storing terabytes of structured data without complex joins or relationships, web application user data, device information for IoT solutions, address books, and any dataset that does not require foreign keys or complex queries.
Access to Table Storage data is provided through the Azure Storage REST API or client libraries available in multiple programming languages. It supports OData protocol for querying and LINQ queries through supported client libraries.
It is worth noting that Azure Cosmos DB also offers a Table API, which provides enhanced capabilities over standard Azure Table Storage, including global distribution, dedicated throughput, and single-digit millisecond latency. Organizations can migrate from Azure Table Storage to Cosmos DB Table API with minimal code changes when they need premium features.
Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen2 is a highly scalable and cost-effective data lake solution built into Azure, designed to handle massive volumes of structured, semi-structured, and unstructured data. It combines the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage, creating a unified and powerful storage platform optimized for big data analytics.
At its core, Data Lake Storage Gen2 adds a hierarchical namespace on top of Azure Blob Storage. This hierarchical file system organizes data into directories and subdirectories, similar to a traditional file system, which significantly improves performance for analytical workloads. Operations like renaming or deleting directories become atomic and efficient, unlike flat namespace blob storage where such operations can be slow and costly.
Key features of Azure Data Lake Storage Gen2 include:
1. **Hadoop Compatibility**: It is fully compatible with the Hadoop Distributed File System (HDFS), allowing seamless integration with big data frameworks like Azure Databricks, Azure HDInsight, and Azure Synapse Analytics.
2. **Cost-Effectiveness**: Built on Azure Blob Storage, it leverages tiered storage pricing (hot, cool, and archive tiers), making it affordable to store petabytes of data.
3. **Security**: It supports fine-grained access control through POSIX-compliant Access Control Lists (ACLs), role-based access control (RBAC), encryption at rest, and integration with Azure Active Directory for authentication.
4. **Scalability**: It can handle exabytes of data with high throughput, making it suitable for enterprise-level analytics workloads.
5. **Multi-Protocol Access**: Data can be accessed via both Blob Storage APIs and Data Lake Storage Gen2 APIs, providing flexibility for different use cases.
In the context of non-relational data on Azure, Data Lake Storage Gen2 serves as an ideal repository for raw data ingestion, data exploration, and feeding data into analytical processing pipelines. It is commonly used as the foundation for modern data architectures like data lakehouses, supporting diverse analytical and machine learning workloads efficiently.
Choosing Azure Storage Services
Choosing the right Azure Storage service is crucial for efficiently managing non-relational data on Azure. Azure offers several storage services, each designed for specific use cases and data types.
**Azure Blob Storage** is ideal for storing massive amounts of unstructured data such as text, binary data, images, videos, and documents. It supports three access tiers—Hot, Cool, and Archive—allowing cost optimization based on how frequently data is accessed. Blob Storage is commonly used for serving content to browsers, distributed file access, streaming, backup, and disaster recovery.
**Azure Data Lake Storage Gen2** builds on top of Blob Storage and adds hierarchical namespace capabilities, making it particularly well-suited for big data analytics workloads. It combines the scalability and cost benefits of Blob Storage with file system semantics, enabling efficient data processing with analytics frameworks like Hadoop and Spark.
**Azure File Storage** provides fully managed cloud file shares accessible via the SMB (Server Message Block) and NFS protocols. It is ideal for lift-and-shift scenarios where applications rely on traditional file system access, shared application settings, and diagnostic data.
**Azure Table Storage** offers a NoSQL key-value store for semi-structured data. It is suitable for storing flexible datasets like user data for web applications, metadata, and address books. It provides fast access and cost-effective storage for large volumes of structured, non-relational data.
**Azure Queue Storage** provides reliable cloud messaging between application components, supporting asynchronous processing and decoupling of services.
When choosing among these services, consider factors such as **data structure** (structured, semi-structured, or unstructured), **access patterns** (frequency and latency requirements), **scalability needs**, **cost considerations**, and **integration requirements** with other Azure services or analytics tools. Additionally, evaluate security features, redundancy options (LRS, GRS, ZRS), and compliance requirements. The right choice depends on aligning your workload characteristics with the strengths of each storage service to achieve optimal performance, cost-efficiency, and reliability.
Azure Cosmos DB Overview and Use Cases
Azure Cosmos DB is Microsoft's globally distributed, multi-model NoSQL database service designed for mission-critical applications. It provides turnkey global distribution, elastic scalability of throughput and storage, single-digit millisecond latency, and comprehensive SLAs covering throughput, latency, availability, and consistency.
**Key Features:**
- **Global Distribution:** Data can be replicated across multiple Azure regions worldwide, enabling low-latency access for users anywhere.
- **Multi-Model Support:** Cosmos DB supports multiple APIs including SQL (Core), MongoDB, Cassandra, Gremlin (graph), and Table API, allowing developers to use familiar tools and query languages.
- **Elastic Scalability:** Throughput and storage scale independently and elastically, accommodating unpredictable workloads seamlessly.
- **Five Consistency Models:** Offers strong, bounded staleness, session, consistent prefix, and eventual consistency, giving developers flexibility to balance performance and data accuracy.
- **Guaranteed Low Latency:** Provides single-digit millisecond read and write latencies at the 99th percentile.
**Common Use Cases:**
1. **IoT and Telematics:** Ingesting massive volumes of sensor data in real-time from connected devices, supporting high write throughput and flexible schemas.
2. **E-Commerce Applications:** Managing product catalogs, user profiles, shopping carts, and order histories that require flexible data models and global availability.
3. **Gaming:** Handling player profiles, leaderboards, and game state with low-latency reads and writes for real-time gaming experiences.
4. **Web and Mobile Applications:** Powering social media interactions, content management, and personalization engines requiring fast, globally distributed data access.
5. **Real-Time Analytics:** Processing and serving real-time data for dashboards, recommendation engines, and event-driven architectures.
6. **Graph-Based Applications:** Using the Gremlin API for social networks, fraud detection, and knowledge graphs.
Azure Cosmos DB is ideal when applications demand high availability (99.999% SLA), low latency, global reach, and flexible schema design. Its serverless and provisioned throughput pricing models make it suitable for both startups and enterprise-scale solutions, providing a fully managed, cost-effective NoSQL database platform.
Azure Cosmos DB API for NoSQL
Azure Cosmos DB API for NoSQL is a native, document-oriented API within Azure Cosmos DB, Microsoft's globally distributed, multi-model database service. It is designed for storing and querying semi-structured data in JSON document format, making it highly flexible for modern application development.
Key features of the API for NoSQL include:
1. **Document Storage**: Data is stored as JSON documents within containers, which are grouped inside databases. Each document can have a different structure, allowing schema-free data modeling. This flexibility makes it ideal for applications where data structures evolve over time.
2. **SQL-Like Query Language**: Despite being a NoSQL database, it supports a familiar SQL-like query syntax for reading and filtering JSON documents. This lowers the learning curve for developers already familiar with SQL, enabling powerful queries including filtering, projections, aggregations, and joins within documents.
3. **Global Distribution**: Azure Cosmos DB allows you to replicate your data across multiple Azure regions worldwide, providing low-latency access to users regardless of their geographic location. The API for NoSQL fully supports this capability with automatic and manual failover options.
4. **Scalability and Performance**: It offers elastic scalability of both throughput and storage. Throughput is measured in Request Units (RUs), allowing you to scale up or down based on application demand. It guarantees single-digit millisecond response times at the 99th percentile.
5. **Multiple Consistency Levels**: Azure Cosmos DB provides five consistency levels—Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual—giving developers fine-grained control over the trade-off between consistency, availability, and latency.
6. **Indexing**: All data is automatically indexed without requiring schema or index management, enabling fast and efficient queries.
The API for NoSQL is the recommended choice for new projects on Azure Cosmos DB, as it provides the most complete feature set, including the latest performance optimizations, SDK support, and native integration with other Azure services. It is well-suited for web, mobile, IoT, and gaming applications that require flexible schemas and global scale.
Azure Cosmos DB API for MongoDB
Azure Cosmos DB API for MongoDB is a fully managed, cloud-based database service provided by Microsoft Azure that allows developers to use their existing MongoDB skills, tools, and code to interact with Azure Cosmos DB. Essentially, it provides a MongoDB-compatible interface on top of Azure Cosmos DB's globally distributed, multi-model database engine.
Key features and considerations include:
1. **Compatibility**: The API supports the MongoDB wire protocol, meaning applications that already use MongoDB can migrate to Azure Cosmos DB with minimal code changes. Developers can use familiar MongoDB drivers, SDKs, and tools such as MongoDB Shell, Compass, and other third-party tools.
2. **Global Distribution**: Unlike traditional MongoDB deployments, Azure Cosmos DB API for MongoDB enables seamless global distribution of data across multiple Azure regions. This provides low-latency access to data for users worldwide and ensures high availability with automatic failover capabilities.
3. **Scalability**: It offers elastic scalability for both throughput and storage. You can scale request units (RUs) up or down based on demand, ensuring cost-effective performance management without worrying about infrastructure provisioning.
4. **Data Model**: Data is stored in JSON-like documents (BSON format), making it ideal for semi-structured or unstructured data. This flexible schema design allows developers to store varying data structures within the same collection.
5. **SLA-Backed Guarantees**: Azure Cosmos DB provides industry-leading SLAs covering availability (99.999%), latency, throughput, and consistency, offering five well-defined consistency models ranging from strong to eventual consistency.
6. **Serverless and Provisioned Options**: Users can choose between serverless mode for sporadic workloads or provisioned throughput mode for predictable performance requirements.
7. **Integrated Security**: It includes enterprise-grade security features such as encryption at rest and in transit, role-based access control (RBAC), and virtual network integration.
Azure Cosmos DB API for MongoDB is ideal for organizations looking to leverage MongoDB's flexible document model while benefiting from Azure Cosmos DB's globally distributed, highly available, and fully managed infrastructure without managing database servers themselves.
Azure Cosmos DB API for PostgreSQL
Azure Cosmos DB API for PostgreSQL (formerly known as Azure Cosmos DB for PostgreSQL or Hyperscale (Citus)) is a managed service that enables you to build highly scalable applications using the familiar PostgreSQL database engine combined with the distributed capabilities of Azure Cosmos DB. It essentially extends PostgreSQL by distributing data and queries across multiple nodes, allowing you to handle massive workloads efficiently.
At its core, this API leverages the Citus open-source extension for PostgreSQL, which transforms a standard PostgreSQL database into a distributed database. Data is automatically sharded (partitioned) across multiple worker nodes, enabling horizontal scaling. This means you can start with a single node and scale out to multiple nodes as your data and performance requirements grow.
Key features include:
1. **Distributed SQL**: You can use standard PostgreSQL SQL syntax, and the system automatically distributes queries across nodes for parallel execution, significantly improving performance for large datasets.
2. **Scalability**: It supports both scale-up (increasing resources on existing nodes) and scale-out (adding more nodes) strategies, making it suitable for applications handling terabytes of data or millions of transactions.
3. **Familiar PostgreSQL Interface**: Developers can use existing PostgreSQL tools, drivers, and extensions, reducing the learning curve. It supports standard PostgreSQL features like JSONB, full-text search, and geospatial queries via PostGIS.
4. **Multi-Tenant and Real-Time Analytics**: It is well-suited for multi-tenant SaaS applications where each tenant's data can be co-located on the same node, as well as real-time analytics dashboards that require fast aggregations over large datasets.
5. **Managed Service Benefits**: Azure handles maintenance tasks such as backups, patching, high availability, and monitoring, allowing teams to focus on application development.
This service bridges the gap between traditional relational PostgreSQL databases and non-relational distributed data architectures, making it a versatile choice for modern cloud applications that need both SQL capabilities and massive scalability within the Azure ecosystem.
Azure Cosmos DB APIs for Cassandra, Table, and Gremlin
Azure Cosmos DB is a globally distributed, multi-model database service that supports multiple APIs, allowing developers to interact with data using familiar interfaces.
**Cassandra API:**
The Cassandra API in Azure Cosmos DB enables developers who are already familiar with Apache Cassandra to use Cosmos DB as the underlying data store with minimal code changes. It supports the Cassandra Query Language (CQL), Cassandra drivers, and tools, making migration seamless. Data is stored in a column-family format, which is ideal for handling large volumes of data across distributed systems. This API is particularly useful for applications requiring high write throughput, flexible schemas, and wide-column storage. It provides the benefits of Cosmos DB such as global distribution, elastic scalability, and guaranteed low latency while maintaining Cassandra compatibility.
**Table API:**
The Table API offers a key-value storage model similar to Azure Table Storage but with significant enhancements. It provides premium capabilities including global distribution, automatic indexing, dedicated throughput, and single-digit millisecond latency. Applications already using Azure Table Storage can migrate to Cosmos DB Table API with minimal changes and benefit from higher SLAs and performance guarantees. Data is organized in tables with rows identified by partition keys and row keys. This API is ideal for applications that need simple key-value lookups, semi-structured data storage, and don't require complex querying or relationships.
**Gremlin API:**
The Gremlin API supports graph database functionality, allowing you to model, store, and query data as graphs consisting of vertices (nodes) and edges (relationships). It uses the Apache TinkerPop Gremlin traversal language for querying graph structures. This API is perfect for scenarios involving complex relationships such as social networks, recommendation engines, fraud detection, and knowledge graphs. It enables efficient traversal of deeply connected datasets where relationships between entities are as important as the entities themselves.
All three APIs benefit from Cosmos DB's core features: turnkey global distribution, elastic scalability, comprehensive SLAs, and automatic indexing.
Cosmos DB Request Units and Partitioning
Azure Cosmos DB is a globally distributed, multi-model database service that uses two fundamental concepts for managing performance and scalability: Request Units (RUs) and Partitioning.
**Request Units (RUs):**
Request Units represent a normalized measure of the cost of database operations in Cosmos DB. Every operation—whether it's a read, write, update, or delete—consumes a certain number of RUs. The cost depends on factors like item size, property count, data consistency level, query complexity, and indexing patterns. For example, a simple point read of a 1 KB item costs approximately 1 RU, while writes and complex queries consume more.
You provision throughput in terms of RUs per second (RU/s). This can be set at the container or database level. Azure offers two provisioning models: **manual throughput**, where you set a fixed RU/s, and **autoscale throughput**, which automatically scales between a minimum and maximum based on demand. If your requests exceed provisioned RU/s, Cosmos DB will rate-limit (throttle) subsequent requests until capacity becomes available.
**Partitioning:**
Cosmos DB uses partitioning to horizontally scale data. Each container is divided into **logical partitions** based on a **partition key** you define. Choosing the right partition key is critical for even data distribution and optimal performance. A good partition key should have high cardinality and distribute reads/writes evenly across partitions.
Logical partitions are mapped to **physical partitions**, which are managed automatically by Azure. Each physical partition can hold up to approximately 50 GB of data and handle up to 10,000 RU/s. As data grows, Cosmos DB automatically splits physical partitions to maintain performance.
Queries that target a single partition (single-partition queries) are the most efficient. Cross-partition queries, which span multiple partitions, consume more RUs and have higher latency.
Together, RUs and partitioning form the backbone of Cosmos DB's performance model, enabling predictable, scalable, and globally distributed data management.
Cosmos DB Consistency Levels
Azure Cosmos DB offers five well-defined consistency levels that allow developers to make precise trade-offs between consistency, availability, and performance. These levels, from strongest to weakest, are:
1. **Strong Consistency**: This guarantees linearizability, meaning reads always return the most recent committed version of an item. A write is only visible after it is durably committed by the majority of replicas. This offers the highest consistency but comes with higher latency and lower availability.
2. **Bounded Staleness**: Reads may lag behind writes by at most a configured number of versions (K) or a time interval (T). This provides a predictable consistency window and is ideal for applications that need strong consistency but can tolerate a small, defined lag. It is often recommended for globally distributed applications requiring high consistency.
3. **Session Consistency**: This is the most widely used and is the default level. It guarantees that within a single client session, reads will always see writes made in that session (read-your-writes, monotonic reads, and monotonic writes). Outside the session, consistency is eventually achieved. It is perfect for user-centric applications.
4. **Consistent Prefix**: This level guarantees that reads never see out-of-order writes. If writes are performed in order A, B, C, a reader will see A, AB, or ABC but never out-of-order combinations like AC. However, there is no guarantee on how quickly a reader will see the latest writes.
5. **Eventual Consistency**: This is the weakest level, where there is no ordering guarantee for reads. Replicas will eventually converge in the absence of further writes. It provides the lowest latency and highest availability, making it suitable for scenarios like counting likes, retweets, or non-threaded comments.
Choosing the right consistency level depends on application requirements. Strong and bounded staleness consume more Request Units (RUs) due to replication overhead, while session, consistent prefix, and eventual consistency are more cost-effective. Cosmos DB's flexibility in offering these levels makes it highly adaptable for diverse global application scenarios.