Delta Lake, Iceberg et Hudi sont les trois table formats open source qui apportent ACID au data lake. En 2025, ce choix architectural est systematiquement evalue en entretien Data Engineer Senior.
1Les limites du Parquet sans table format
Question discriminante
Pourquoi Parquet seul ne suffit pas pour un data lake de production ?
- Pas d ACID — deux jobs qui ecrivent en meme temps peuvent corrompre les donnees
- Pas de rollback — si un job echoue a mi-parcours, les donnees sont dans un etat incoherent
- Schema evolution fragile — ajouter une colonne peut casser les readers existants
- Pas de upsert natif — mettre a jour une ligne dans Parquet = réécrire toute la partition
- Pas de time travel — impossible de requeter les donnees d hier sans les avoir sauvegardees separement
2Delta Lake : le standard Databricks/Microsoft
Question discriminante
Quels sont les 3 composants principaux de Delta Lake ?
- Transaction Log — fichier _delta_log/ qui trace toutes les operations (commits). Source de verite pour ACID et time travel
- OPTIMIZE + ZORDER — compacter les petits fichiers et co-localiser les donnees frequemment requetees ensemble
- Auto Optimize — compaction automatique dans Databricks. Elimine les petits fichiers crees par le streaming
- Ecosystem — natif Databricks, supporte par Spark, PySpark. De plus en plus supporte en dehors de Databricks (Trino, Flink)
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .getOrCreate()
# Créer une table Delta
df.write.format("delta").partitionBy("date").save("/data/delta/orders")
# Lire avec Time Travel
df_yesterday = spark.read.format("delta") .option("versionAsOf", 5) .load("/data/delta/orders")
# MERGE (upsert) sur Delta
delta_table = DeltaTable.forPath(spark, "/data/delta/orders")
delta_table.alias("t").merge(
source_df.alias("s"),
"t.order_id = s.order_id"
).whenMatchedUpdateAll() .whenNotMatchedInsertAll() .execute()
# OPTIMIZE + ZORDER
spark.sql("OPTIMIZE delta.`/data/delta/orders` ZORDER BY (customer_id)")
spark.sql("VACUUM delta.`/data/delta/orders` RETAIN 168 HOURS")
- ACID transactions — Delta Lake garantit l'atomicité : un job qui échoue à mi-chemin ne laisse pas de données corrompues. Impossible avec du Parquet brut
- Schema enforcement — Delta rejette les écritures avec un schéma incompatible. Schema evolution avec
mergeSchema=true pour ajouter des colonnes
- Small files problem — OPTIMIZE compacte les petits fichiers Parquet en gros fichiers. ZORDER trie les données pour le pruning
3Apache Iceberg : le standard multi-engine
Question discriminante
En quoi Iceberg est-il superieur a Delta Lake pour les environnements multi-engine ?
- Ouvert et portable — supporte nativement par Spark, Trino, Flink, Hive, Dremio, Athena (AWS), BigQuery. Pas de dependance a un vendeur
- Partition evolution — changer le schema de partitionnement sans réécrire les donnees. Delta Lake ne le permet pas nativement
- Hidden partitioning — les users n ont pas besoin de connaitre la structure de partition pour ecrire des requetes efficaces
- Catalog — Iceberg Catalog (REST, Hive, Glue, Nessie). Metadata centralisee pour la gouvernance
- Adoption 2025 — AWS, Google Cloud et Azure supportent Iceberg nativement. Tendance forte
-- Iceberg avec Spark SQL
CREATE TABLE catalog.db.orders (
order_id BIGINT,
customer_id BIGINT,
amount DOUBLE,
order_date DATE
) USING iceberg
PARTITIONED BY (days(order_date));
-- Time Travel Iceberg
SELECT * FROM catalog.db.orders
FOR SYSTEM_TIME AS OF '2025-01-01 00:00:00';
-- Compaction Iceberg
CALL catalog.system.rewrite_data_files(
table => 'db.orders',
strategy => 'sort',
sort_order => 'zorder(customer_id, amount)'
);
-- Expirer les snapshots anciens
CALL catalog.system.expire_snapshots(
table => 'db.orders',
older_than => TIMESTAMP '2025-01-01 00:00:00'
);
- Format ouvert — Iceberg n'est pas lié à Databricks. Fonctionne avec Spark, Trino, Flink, DuckDB, BigQuery Omni
- Row-level deletes — DELETE FROM ... WHERE fonctionne sans réécrire toute la partition. Copy-on-write (défaut) ou Merge-on-read (pour les DELETE fréquents)
- Partition evolution — changer la stratégie de partition sans réécrire les données. Impossible avec Hive
- Catalog Iceberg — REST Catalog (Polaris, Nessie), AWS Glue, Hive Metastore. Le catalog maintient les métadonnées et les snapshots
4Apache Hudi : le specialiste du CDC
Question discriminante
Quand utilisez-vous Hudi plutot que Delta ou Iceberg ?
- Hudi — optimise pour les upserts frequents et le CDC (Change Data Capture). Deux types de tables : COW (Copy-on-Write) et MOR (Merge-on-Read)
- COW — réécrire les fichiers a chaque update. Lecture rapide, ecriture lente
- MOR — ecrire les updates dans des fichiers delta, merger a la lecture. Ecriture rapide, lecture plus lente
- Cas d usage — pipelines de CDC depuis des bases transactionnelles (Debezium + Hudi), donnees qui changent souvent
5Comparaison des trois formats
| Delta Lake | Iceberg | Hudi |
|---|
| ACID | Oui | Oui | Oui |
| Time Travel | Oui (delta_log) | Oui (snapshots) | Oui (commits) |
| Partition Evolution | Limitee | Complete | Limitee |
| Multi-engine | Moyen (Databricks-centric) | Excellent | Bon |
| Upsert/CDC | Bon (MERGE) | Bon | Excellent (MOR) |
| Ecosysteme cloud | Azure (Fabric), Databricks | AWS, GCP, Azure | AWS EMR |
6Comment choisir en pratique
- Stack Databricks ou Azure → Delta Lake. Natif, tres bien integre
- Multi-cloud, multi-engine, independance vendeur → Iceberg. Standard emergent
- CDC intensif depuis bases transactionnelles → Hudi. Upserts optimises
- Greenfield 2025 → Iceberg. L ecosysteme converge vers Iceberg comme standard ouvert
- Delta Lake — standard de facto dans l'écosystème Databricks/Spark. Meilleures performances avec Photon. Open source depuis 2019
- Apache Iceberg — format ouvert multi-engine. Choix stratégique pour éviter le vendor lock-in Databricks. Adopté par Netflix, Apple, AWS
- Apache Hudi — optimisé pour les upserts fréquents (use case fintech, IoT). Moins répandu que Delta/Iceberg en 2025
- Tendance 2025 — Iceberg gagne du terrain car AWS, Google et Snowflake le supportent nativement. Delta reste dominant si stack Databricks
Tendance 2025 : Delta Lake et Hudi ont annonce la compatibilite avec le format Iceberg. La guerre des formats est en train de se terminer en faveur d Iceberg comme standard de lecture.
7Grille par niveau
| Niveau | Maitrise | Signal GO | NO-GO |
|---|
| Confirme | Delta Lake ou Iceberg, ACID, Time Travel, MERGE | Explique pourquoi Parquet seul ne suffit pas, a utilise Delta MERGE | Ne sait pas ce qu est un table format |
| Senior | Comparaison Delta/Iceberg/Hudi, choix selon le contexte | Justifie le choix du table format selon l ecosysteme, connait la partition evolution Iceberg | Ne connait qu un seul table format |
1The Limits of Parquet Without a Table Format
Discriminating question
Why is Parquet alone not enough for a production data lake?
- No ACID — two jobs writing at the same time can corrupt the data
- No rollback — if a job fails halfway through, the data is in an inconsistent state
- Fragile schema evolution — adding a column can break existing readers
- No native upsert — updating a row in Parquet means rewriting the entire partition
- No time travel — impossible to query yesterday's data without having saved it separately
2Delta Lake: The Databricks/Microsoft Standard
Discriminating question
What are the 3 main components of Delta Lake?
- Transaction Log — _delta_log/ file that tracks all operations (commits). Source of truth for ACID and time travel
- OPTIMIZE + ZORDER — compact small files and co-locate frequently queried data together
- Auto Optimize — automatic compaction in Databricks. Eliminates small files created by streaming
- Ecosystem — native Databricks, supported by Spark, PySpark. Increasingly supported outside Databricks (Trino, Flink)
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .getOrCreate()
# Create a Delta table
df.write.format("delta").partitionBy("date").save("/data/delta/orders")
# Read with Time Travel
df_yesterday = spark.read.format("delta") .option("versionAsOf", 5) .load("/data/delta/orders")
# MERGE (upsert) on Delta
delta_table = DeltaTable.forPath(spark, "/data/delta/orders")
delta_table.alias("t").merge(
source_df.alias("s"),
"t.order_id = s.order_id"
).whenMatchedUpdateAll() .whenNotMatchedInsertAll() .execute()
# OPTIMIZE + ZORDER
spark.sql("OPTIMIZE delta.`/data/delta/orders` ZORDER BY (customer_id)")
spark.sql("VACUUM delta.`/data/delta/orders` RETAIN 168 HOURS")
- ACID transactions — Delta Lake guarantees atomicity: a job that fails halfway does not leave corrupted data. Impossible with raw Parquet
- Schema enforcement — Delta rejects writes with an incompatible schema. Schema evolution with
mergeSchema=true to add columns
- Small files problem — OPTIMIZE compacts small Parquet files into large files. ZORDER sorts data for pruning
3Apache Iceberg: The Multi-Engine Standard
Discriminating question
How is Iceberg superior to Delta Lake in multi-engine environments?
- Open and portable — natively supported by Spark, Trino, Flink, Hive, Dremio, Athena (AWS), BigQuery. No vendor dependency
- Partition evolution — change the partitioning schema without rewriting data. Delta Lake does not support this natively
- Hidden partitioning — users do not need to know the partition structure to write efficient queries
- Catalog — Iceberg Catalog (REST, Hive, Glue, Nessie). Centralized metadata for governance
- 2025 adoption — AWS, Google Cloud and Azure support Iceberg natively. Strong trend
-- Iceberg with Spark SQL
CREATE TABLE catalog.db.orders (
order_id BIGINT,
customer_id BIGINT,
amount DOUBLE,
order_date DATE
) USING iceberg
PARTITIONED BY (days(order_date));
-- Iceberg Time Travel
SELECT * FROM catalog.db.orders
FOR SYSTEM_TIME AS OF '2025-01-01 00:00:00';
-- Iceberg Compaction
CALL catalog.system.rewrite_data_files(
table => 'db.orders',
strategy => 'sort',
sort_order => 'zorder(customer_id, amount)'
);
-- Expire old snapshots
CALL catalog.system.expire_snapshots(
table => 'db.orders',
older_than => TIMESTAMP '2025-01-01 00:00:00'
);
- Open format — Iceberg is not tied to Databricks. Works with Spark, Trino, Flink, DuckDB, BigQuery Omni
- Row-level deletes — DELETE FROM ... WHERE works without rewriting the entire partition. Copy-on-write (default) or Merge-on-read (for frequent DELETEs)
- Partition evolution — change the partition strategy without rewriting data. Impossible with Hive
- Iceberg Catalog — REST Catalog (Polaris, Nessie), AWS Glue, Hive Metastore. The catalog maintains metadata and snapshots
4Apache Hudi: The CDC Specialist
Discriminating question
When do you use Hudi over Delta or Iceberg?
- Hudi — optimized for frequent upserts and CDC (Change Data Capture). Two table types: COW (Copy-on-Write) and MOR (Merge-on-Read)
- COW — rewrite files on every update. Fast reads, slow writes
- MOR — write updates to delta files, merge at read time. Fast writes, slower reads
- Use cases — CDC pipelines from transactional databases (Debezium + Hudi), frequently changing data
5Comparison of the Three Formats
| Delta Lake | Iceberg | Hudi |
|---|
| ACID | Yes | Yes | Yes |
| Time Travel | Yes (delta_log) | Yes (snapshots) | Yes (commits) |
| Partition Evolution | Limited | Full | Limited |
| Multi-engine | Average (Databricks-centric) | Excellent | Good |
| Upsert/CDC | Good (MERGE) | Good | Excellent (MOR) |
| Cloud ecosystem | Azure (Fabric), Databricks | AWS, GCP, Azure | AWS EMR |
6How to Choose in Practice
- Databricks or Azure stack → Delta Lake. Native, very well integrated
- Multi-cloud, multi-engine, vendor independence → Iceberg. Emerging standard
- Intensive CDC from transactional databases → Hudi. Optimized upserts
- Greenfield 2025 → Iceberg. The ecosystem is converging toward Iceberg as the open standard
- Delta Lake — de facto standard in the Databricks/Spark ecosystem. Best performance with Photon. Open source since 2019
- Apache Iceberg — open multi-engine format. Strategic choice to avoid Databricks vendor lock-in. Adopted by Netflix, Apple, AWS
- Apache Hudi — optimized for frequent upserts (fintech, IoT use cases). Less widespread than Delta/Iceberg in 2025
- 2025 trend — Iceberg is gaining ground as AWS, Google and Snowflake support it natively. Delta remains dominant on Databricks stacks
2025 trend: Delta Lake and Hudi have announced compatibility with the Iceberg format. The format war is coming to an end in favor of Iceberg as the read standard.
7Level Grid
| Level | Mastery | GO Signal | NO-GO |
|---|
| Mid-level | Delta Lake or Iceberg, ACID, Time Travel, MERGE | Explains why Parquet alone is not enough, has used Delta MERGE | Does not know what a table format is |
| Senior | Delta/Iceberg/Hudi comparison, choice based on context | Justifies table format choice based on ecosystem, knows Iceberg partition evolution | Only knows one table format |