Common Data File Formats – DP-900 Study Guide
Why Common Data File Formats Matter
Understanding common data file formats is a foundational concept for the DP-900: Microsoft Azure Data Fundamentals exam. Data is stored, transferred, and processed in a variety of file formats, and knowing the characteristics of each format is essential for choosing the right one for a given scenario. In practice, every data solution—from a simple data lake to a complex analytics pipeline—relies on data being stored in a particular format. Making the wrong choice can lead to inefficiency, higher costs, and poor query performance. On the exam, Microsoft expects you to distinguish between formats and understand when each is appropriate.
What Are Common Data File Formats?
Common data file formats are standardized ways of structuring and encoding data for storage and transmission. The key formats you need to know for the DP-900 exam include:
1. CSV (Comma-Separated Values)
- A plain-text, delimited format where each row is a line and fields are separated by commas (or sometimes tabs, semicolons, etc.).
- The first row often contains column headers.
- Human-readable and widely supported by virtually every application.
- Limitations: No enforcement of data types. All values are stored as text. Does not support hierarchical or nested data. Large files can be inefficient because there is no compression by default and every value is stored as a string.
2. JSON (JavaScript Object Notation)
- A semi-structured, text-based format that represents data as key-value pairs and arrays.
- Supports hierarchical (nested) data structures, making it ideal for representing complex objects.
- Human-readable and widely used in web APIs, NoSQL databases, and configuration files.
- Each document can have a different structure (schema-on-read), providing flexibility.
- Limitations: Can be verbose due to repeated key names. Not the most space-efficient format for large-scale analytical workloads.
3. XML (Extensible Markup Language)
- A text-based, tag-delimited format that supports hierarchical data using opening and closing tags.
- Was once the dominant format for data interchange before JSON gained popularity.
- Supports schemas (XSD) for strict validation and namespaces for avoiding element name conflicts.
- Human-readable but more verbose than JSON.
- Limitations: Very verbose, leading to larger file sizes. More complex to parse than JSON.
4. Avro
- A row-based binary format developed within the Apache Hadoop ecosystem.
- Stores data with an embedded JSON schema in the file header, making it self-describing.
- Excellent for write-heavy workloads and data serialization (e.g., streaming data or message passing).
- Supports schema evolution—fields can be added or removed over time without breaking compatibility.
- Compact and efficient for storage and network transfer.
- Key point: Row-based means entire rows are stored together, which is ideal for writing complete records.
5. Parquet
- A columnar binary format also developed for the Hadoop ecosystem (by Apache).
- Stores data column by column rather than row by row.
- Excellent for read-heavy, analytical workloads because queries often only need a subset of columns. Columnar storage allows the engine to read only the columns needed, greatly improving performance and reducing I/O.
- Supports efficient compression because similar data types are stored together in columns, leading to smaller file sizes.
- Supports schema information and complex nested data structures.
- Key point: The go-to format for data lakes and big data analytics scenarios.
6. ORC (Optimized Row Columnar)
- Another columnar binary format optimized for large-scale data processing (commonly associated with Apache Hive).
- Similar benefits to Parquet: efficient compression, column pruning, and fast analytical reads.
- Supports predicate pushdown (filtering at the storage level) and lightweight indexing.
- Often mentioned alongside Parquet as a columnar analytics format.
How It All Works Together
When data enters a system, the choice of file format depends on the workload:
- Data ingestion and streaming: Row-based formats like Avro are preferred because entire records are written quickly and schema evolution is supported.
- Data analytics and querying: Columnar formats like Parquet or ORC are preferred because analytical queries typically scan specific columns across many rows, and columnar storage minimizes unnecessary I/O.
- Data exchange between systems: Text-based formats like CSV, JSON, or XML are commonly used because of their broad compatibility and human readability.
- Web APIs and NoSQL stores: JSON is the dominant format for REST APIs and document databases like Azure Cosmos DB.
In Azure, these formats are used throughout the platform:
- Azure Data Lake Storage stores data in Parquet, Avro, CSV, JSON, and other formats.
- Azure Synapse Analytics and Azure Databricks heavily leverage Parquet for analytics.
- Azure Event Hubs and Azure Stream Analytics commonly use Avro or JSON for streaming data.
- Azure Blob Storage can hold data in any format.
Key Comparisons to Remember
Row-based vs. Columnar:
- Row-based (CSV, Avro): Data is stored one row at a time. Better for writing full records and transactional operations.
- Columnar (Parquet, ORC): Data is stored one column at a time. Better for reading subsets of columns in analytical queries.
Text-based vs. Binary:
- Text-based (CSV, JSON, XML): Human-readable, larger file sizes, no built-in compression.
- Binary (Avro, Parquet, ORC): Not human-readable, compact, highly efficient with built-in compression support.
Structured vs. Semi-structured:
- CSV is considered structured (tabular rows and columns).
- JSON and XML are semi-structured (flexible schema, hierarchical nesting).
- Avro, Parquet, and ORC support both structured and semi-structured data with embedded schemas.
Exam Tips: Answering Questions on Common Data File Formats
1. Know the key characteristics of each format. The exam frequently asks you to identify a format based on a description. If a question mentions columnar storage optimized for analytics, think Parquet. If it mentions row-based with embedded schema for data serialization, think Avro.
2. Understand row-based vs. columnar. This is one of the most commonly tested distinctions. Remember: columnar = analytical reads; row-based = writes and record-level operations.
3. Remember that Parquet is the preferred format for analytical queries in Azure data lakes. If a question involves Azure Synapse, Azure Databricks, or a data lake optimized for analysis, Parquet is almost always the correct answer.
4. Remember that Avro is preferred for data ingestion and streaming scenarios. Questions about Azure Event Hubs, Kafka, or write-heavy workloads will often point to Avro.
5. JSON is the default for semi-structured data and web APIs. If a question asks about REST APIs, Azure Cosmos DB, or flexible/hierarchical document storage, JSON is the answer.
6. CSV is the simplest and most universally compatible format but lacks data type enforcement and is not efficient for large-scale analytics. Questions about basic data export, broad compatibility, or human readability often point to CSV.
7. Don't confuse XML with JSON. XML uses tags and is more verbose. JSON uses key-value pairs and is more lightweight. If a question mentions tags and namespaces, it's XML.
8. Watch for keywords in the question. Words like compressed, efficient, optimized, analytical, columnar → Parquet/ORC. Words like serialization, streaming, schema evolution, row-based → Avro. Words like human-readable, plain text, delimited → CSV. Words like nested, hierarchical, key-value, documents, API → JSON.
9. Understand compression benefits. Binary columnar formats (Parquet, ORC) achieve the best compression because columns of the same data type compress efficiently. This leads to cost savings in storage and faster reads.
10. Practice scenario-based questions. The DP-900 exam often presents a scenario and asks which format is best. Focus on whether the scenario is about writing data (Avro, CSV), reading/analyzing data (Parquet, ORC), or exchanging data (JSON, CSV, XML). This simple framework will help you quickly narrow down the correct answer.