Features of Semi-Structured Data – DP-900 Exam Guide
Introduction
Semi-structured data is one of the most important concepts tested on the Microsoft DP-900: Azure Data Fundamentals exam. Understanding how it differs from structured and unstructured data, and knowing its key features, is essential for answering exam questions accurately. This guide covers everything you need to know about semi-structured data, including what it is, why it matters, how it works, and tips for answering exam questions.
Why Is Semi-Structured Data Important?
In today's data landscape, a significant portion of real-world data does not fit neatly into traditional relational tables. Semi-structured data bridges the gap between fully structured data (like rows and columns in a relational database) and completely unstructured data (like images, audio, and free-form text). Understanding semi-structured data is important because:
• It is widely used in modern applications, APIs, IoT systems, and web services.
• Cloud platforms like Microsoft Azure provide multiple services to store and process semi-structured data (e.g., Azure Cosmos DB, Azure Blob Storage).
• Many data integration and analytics scenarios require working with semi-structured formats such as JSON, XML, and YAML.
• It enables flexibility in how data is organized without sacrificing the ability to query and analyze it.
What Is Semi-Structured Data?
Semi-structured data is data that does not conform to a rigid, fixed schema like a relational database table, but still has some organizational structure through the use of tags, keys, markers, or hierarchies that separate elements and establish relationships.
Key Characteristics of Semi-Structured Data:
1. No Fixed Schema (Schema-on-Read): Unlike structured data, semi-structured data does not require a predefined schema. The schema is interpreted when the data is read, not when it is written. This means different records or documents in the same collection can have different fields.
2. Self-Describing: Semi-structured data contains metadata or tags within the data itself that describe the structure and meaning of the content. For example, in JSON, each data element is labeled with a key name, and in XML, tags describe the data they contain.
3. Hierarchical or Nested Structure: Semi-structured data supports nesting and hierarchies. An entity can contain sub-entities, arrays, or nested objects. For example, a JSON document can have an array of addresses inside a customer object.
4. Flexible and Varied: Each data entity (document or record) can have a different set of fields. One customer record might have a phone number while another does not. This is sometimes referred to as having a variable structure.
5. Common Formats: The most common semi-structured data formats include:
• JSON (JavaScript Object Notation): Lightweight, human-readable, widely used in web APIs and NoSQL databases like Azure Cosmos DB.
• XML (Extensible Markup Language): Uses tags to define elements, commonly used in enterprise data exchange and configuration files.
• YAML (YAML Ain't Markup Language): Human-friendly format often used for configuration files.
• Avro, Parquet, ORC: Serialization formats that can represent semi-structured data, often used in big data processing.
6. Entities May Have Unique Keys: Semi-structured data often uses unique identifiers (keys) for each entity or document, enabling retrieval and indexing without a relational schema.
How Does Semi-Structured Data Work?
Semi-structured data is typically stored in document databases, key-value stores, or data lakes. Here is how it works in practice:
Storage:
• In Azure Cosmos DB, semi-structured data is stored as JSON documents in containers. Each document can have a different structure.
• In Azure Blob Storage or Azure Data Lake Storage, semi-structured files like JSON or XML can be stored as files and processed by analytics engines.
Querying:
• Azure Cosmos DB uses SQL-like queries to retrieve data from JSON documents.
• Azure Synapse Analytics and Azure Data Factory can parse and process semi-structured data files.
Example of Semi-Structured Data (JSON):
{
"customerId": 1,
"name": "Jane Smith",
"email": "jane@example.com",
"addresses": [
{"type": "home", "city": "Seattle"},
{"type": "work", "city": "Redmond"}
]
}
Notice how this document is self-describing (each value has a key), hierarchical (addresses are nested), and flexible (another customer document could omit the addresses field entirely).
Comparing Data Types:
Structured Data: Fixed schema, rows and columns, relational databases (e.g., Azure SQL Database). Every record has the same fields.
Semi-Structured Data: Flexible schema, self-describing with tags/keys, stored in NoSQL databases or files (e.g., JSON in Azure Cosmos DB). Records can vary.
Unstructured Data: No schema or internal structure, examples include images, videos, audio files, PDFs, and free-form text (e.g., stored in Azure Blob Storage).
Exam Tips: Answering Questions on Features of Semi-Structured Data
1. Know the definition: If a question asks what semi-structured data is, remember it has some organizational properties (tags, keys, markers) but does not conform to a rigid tabular schema. It is NOT the same as unstructured data.
2. Recognize formats: JSON, XML, and YAML are the go-to examples of semi-structured data. If an exam question mentions any of these formats, the answer is almost certainly semi-structured data.
3. Self-describing is key: A hallmark feature of semi-structured data tested on the exam is that it is self-describing. The data contains its own metadata (through keys or tags). If you see this phrase, think semi-structured.
4. Schema-on-read vs. schema-on-write: Semi-structured data uses a schema-on-read approach. Structured/relational data uses schema-on-write. This distinction is commonly tested.
5. Differentiate from structured and unstructured: Many exam questions present scenarios and ask you to identify the data type. If data has tags or keys but no fixed table schema, it is semi-structured. If it is in rows and columns with a fixed schema, it is structured. If it has no organizational structure (images, video), it is unstructured.
6. Azure Cosmos DB connection: The exam frequently associates semi-structured data with Azure Cosmos DB, which is a NoSQL database designed to store JSON documents. If a question mentions Cosmos DB and asks about the type of data, the answer is likely semi-structured.
7. Variable fields: Remember that in semi-structured data, not all entities need to have the same set of fields. This flexibility is a distinguishing feature and is often tested through scenario-based questions.
8. Watch for tricky wording: Some questions may describe data that has some structure (like key-value pairs) but is not in a relational table. Do not be misled into selecting structured data. If it is not in a rigid tabular format with a predefined schema, it is most likely semi-structured.
9. Serialization formats in big data: If a question mentions Avro, Parquet, or ORC in the context of data that is not purely tabular, consider semi-structured as a possible answer, though these formats can also represent structured data depending on context.
10. Practice with examples: Before the exam, make sure you can look at a sample JSON or XML snippet and identify it as semi-structured data. The exam may present visual or text-based examples.
Summary
Semi-structured data is a flexible, self-describing data type that uses tags, keys, or markers to provide organizational structure without requiring a fixed schema. Common formats include JSON, XML, and YAML. It is a critical concept on the DP-900 exam, and understanding its features—especially how it differs from structured and unstructured data—will help you answer exam questions confidently. Focus on recognizing the key characteristics: no fixed schema, self-describing nature, hierarchical structure, variable fields across entities, and its association with NoSQL databases like Azure Cosmos DB.