Accueil›Blog›Test technique Delta Lake vs Apache Iceberg : le choix du table format

Guide recrutement data

Test technique Delta Lake vs Apache Iceberg : le choix du table format

Delta Lake, Iceberg et Hudi sont les trois table formats open source qui apportent ACID au data lake. En 2025, ce choix architectural est systematiquement evalue en entretien Data Engineer Senior.

Data Builder·Juin 2025·7 min de lecture·Data Engineer

Sommaire

Problemes du Parquet pur
Delta Lake
Apache Iceberg
Apache Hudi
Comparaison des trois
Comment choisir
Grille

1Les limites du Parquet sans table format

Question discriminante

Pourquoi Parquet seul ne suffit pas pour un data lake de production ?

Pas d ACID — deux jobs qui ecrivent en meme temps peuvent corrompre les donnees
Pas de rollback — si un job echoue a mi-parcours, les donnees sont dans un etat incoherent
Schema evolution fragile — ajouter une colonne peut casser les readers existants
Pas de upsert natif — mettre a jour une ligne dans Parquet = réécrire toute la partition
Pas de time travel — impossible de requeter les donnees d hier sans les avoir sauvegardees separement

2Delta Lake : le standard Databricks/Microsoft

Question discriminante

Quels sont les 3 composants principaux de Delta Lake ?

Transaction Log — fichier _delta_log/ qui trace toutes les operations (commits). Source de verite pour ACID et time travel
OPTIMIZE + ZORDER — compacter les petits fichiers et co-localiser les donnees frequemment requetees ensemble
Auto Optimize — compaction automatique dans Databricks. Elimine les petits fichiers crees par le streaming
Ecosystem — natif Databricks, supporte par Spark, PySpark. De plus en plus supporte en dehors de Databricks (Trino, Flink)

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder     .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0")     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")     .getOrCreate()

# Créer une table Delta
df.write.format("delta").partitionBy("date").save("/data/delta/orders")

# Lire avec Time Travel
df_yesterday = spark.read.format("delta")     .option("versionAsOf", 5)     .load("/data/delta/orders")

# MERGE (upsert) sur Delta
delta_table = DeltaTable.forPath(spark, "/data/delta/orders")
delta_table.alias("t").merge(
    source_df.alias("s"),
    "t.order_id = s.order_id"
).whenMatchedUpdateAll()  .whenNotMatchedInsertAll()  .execute()

# OPTIMIZE + ZORDER
spark.sql("OPTIMIZE delta.`/data/delta/orders` ZORDER BY (customer_id)")
spark.sql("VACUUM delta.`/data/delta/orders` RETAIN 168 HOURS")

ACID transactions — Delta Lake garantit l'atomicité : un job qui échoue à mi-chemin ne laisse pas de données corrompues. Impossible avec du Parquet brut
Schema enforcement — Delta rejette les écritures avec un schéma incompatible. Schema evolution avec mergeSchema=true pour ajouter des colonnes
Small files problem — OPTIMIZE compacte les petits fichiers Parquet en gros fichiers. ZORDER trie les données pour le pruning

3Apache Iceberg : le standard multi-engine

Question discriminante

En quoi Iceberg est-il superieur a Delta Lake pour les environnements multi-engine ?

Ouvert et portable — supporte nativement par Spark, Trino, Flink, Hive, Dremio, Athena (AWS), BigQuery. Pas de dependance a un vendeur
Partition evolution — changer le schema de partitionnement sans réécrire les donnees. Delta Lake ne le permet pas nativement
Hidden partitioning — les users n ont pas besoin de connaitre la structure de partition pour ecrire des requetes efficaces
Catalog — Iceberg Catalog (REST, Hive, Glue, Nessie). Metadata centralisee pour la gouvernance
Adoption 2025 — AWS, Google Cloud et Azure supportent Iceberg nativement. Tendance forte

-- Iceberg avec Spark SQL
CREATE TABLE catalog.db.orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DOUBLE,
    order_date DATE
) USING iceberg
PARTITIONED BY (days(order_date));

-- Time Travel Iceberg
SELECT * FROM catalog.db.orders
FOR SYSTEM_TIME AS OF '2025-01-01 00:00:00';

-- Compaction Iceberg
CALL catalog.system.rewrite_data_files(
    table => 'db.orders',
    strategy => 'sort',
    sort_order => 'zorder(customer_id, amount)'
);

-- Expirer les snapshots anciens
CALL catalog.system.expire_snapshots(
    table => 'db.orders',
    older_than => TIMESTAMP '2025-01-01 00:00:00'
);

Format ouvert — Iceberg n'est pas lié à Databricks. Fonctionne avec Spark, Trino, Flink, DuckDB, BigQuery Omni
Row-level deletes — DELETE FROM ... WHERE fonctionne sans réécrire toute la partition. Copy-on-write (défaut) ou Merge-on-read (pour les DELETE fréquents)
Partition evolution — changer la stratégie de partition sans réécrire les données. Impossible avec Hive
Catalog Iceberg — REST Catalog (Polaris, Nessie), AWS Glue, Hive Metastore. Le catalog maintient les métadonnées et les snapshots

4Apache Hudi : le specialiste du CDC

Question discriminante

Quand utilisez-vous Hudi plutot que Delta ou Iceberg ?

Hudi — optimise pour les upserts frequents et le CDC (Change Data Capture). Deux types de tables : COW (Copy-on-Write) et MOR (Merge-on-Read)
COW — réécrire les fichiers a chaque update. Lecture rapide, ecriture lente
MOR — ecrire les updates dans des fichiers delta, merger a la lecture. Ecriture rapide, lecture plus lente
Cas d usage — pipelines de CDC depuis des bases transactionnelles (Debezium + Hudi), donnees qui changent souvent

5Comparaison des trois formats

	Delta Lake	Iceberg	Hudi
ACID	Oui	Oui	Oui
Time Travel	Oui (delta_log)	Oui (snapshots)	Oui (commits)
Partition Evolution	Limitee	Complete	Limitee
Multi-engine	Moyen (Databricks-centric)	Excellent	Bon
Upsert/CDC	Bon (MERGE)	Bon	Excellent (MOR)
Ecosysteme cloud	Azure (Fabric), Databricks	AWS, GCP, Azure	AWS EMR

6Comment choisir en pratique

Stack Databricks ou Azure → Delta Lake. Natif, tres bien integre
Multi-cloud, multi-engine, independance vendeur → Iceberg. Standard emergent
CDC intensif depuis bases transactionnelles → Hudi. Upserts optimises
Greenfield 2025 → Iceberg. L ecosysteme converge vers Iceberg comme standard ouvert

Delta Lake — standard de facto dans l'écosystème Databricks/Spark. Meilleures performances avec Photon. Open source depuis 2019
Apache Iceberg — format ouvert multi-engine. Choix stratégique pour éviter le vendor lock-in Databricks. Adopté par Netflix, Apple, AWS
Apache Hudi — optimisé pour les upserts fréquents (use case fintech, IoT). Moins répandu que Delta/Iceberg en 2025
Tendance 2025 — Iceberg gagne du terrain car AWS, Google et Snowflake le supportent nativement. Delta reste dominant si stack Databricks

Tendance 2025 : Delta Lake et Hudi ont annonce la compatibilite avec le format Iceberg. La guerre des formats est en train de se terminer en faveur d Iceberg comme standard de lecture.

7Grille par niveau

Niveau	Maitrise	Signal GO	NO-GO
Confirme	Delta Lake ou Iceberg, ACID, Time Travel, MERGE	Explique pourquoi Parquet seul ne suffit pas, a utilise Delta MERGE	Ne sait pas ce qu est un table format
Senior	Comparaison Delta/Iceberg/Hudi, choix selon le contexte	Justifie le choix du table format selon l ecosysteme, connait la partition evolution Iceberg	Ne connait qu un seul table format

1The Limits of Parquet Without a Table Format

Discriminating question

Why is Parquet alone not enough for a production data lake?

No ACID — two jobs writing at the same time can corrupt the data
No rollback — if a job fails halfway through, the data is in an inconsistent state
Fragile schema evolution — adding a column can break existing readers
No native upsert — updating a row in Parquet means rewriting the entire partition
No time travel — impossible to query yesterday's data without having saved it separately

2Delta Lake: The Databricks/Microsoft Standard

Discriminating question

What are the 3 main components of Delta Lake?

Transaction Log — _delta_log/ file that tracks all operations (commits). Source of truth for ACID and time travel
OPTIMIZE + ZORDER — compact small files and co-locate frequently queried data together
Auto Optimize — automatic compaction in Databricks. Eliminates small files created by streaming
Ecosystem — native Databricks, supported by Spark, PySpark. Increasingly supported outside Databricks (Trino, Flink)

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder     .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0")     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")     .getOrCreate()

# Create a Delta table
df.write.format("delta").partitionBy("date").save("/data/delta/orders")

# Read with Time Travel
df_yesterday = spark.read.format("delta")     .option("versionAsOf", 5)     .load("/data/delta/orders")

# MERGE (upsert) on Delta
delta_table = DeltaTable.forPath(spark, "/data/delta/orders")
delta_table.alias("t").merge(
    source_df.alias("s"),
    "t.order_id = s.order_id"
).whenMatchedUpdateAll()  .whenNotMatchedInsertAll()  .execute()

# OPTIMIZE + ZORDER
spark.sql("OPTIMIZE delta.`/data/delta/orders` ZORDER BY (customer_id)")
spark.sql("VACUUM delta.`/data/delta/orders` RETAIN 168 HOURS")

ACID transactions — Delta Lake guarantees atomicity: a job that fails halfway does not leave corrupted data. Impossible with raw Parquet
Schema enforcement — Delta rejects writes with an incompatible schema. Schema evolution with mergeSchema=true to add columns
Small files problem — OPTIMIZE compacts small Parquet files into large files. ZORDER sorts data for pruning

3Apache Iceberg: The Multi-Engine Standard

Discriminating question

How is Iceberg superior to Delta Lake in multi-engine environments?

Open and portable — natively supported by Spark, Trino, Flink, Hive, Dremio, Athena (AWS), BigQuery. No vendor dependency
Partition evolution — change the partitioning schema without rewriting data. Delta Lake does not support this natively
Hidden partitioning — users do not need to know the partition structure to write efficient queries
Catalog — Iceberg Catalog (REST, Hive, Glue, Nessie). Centralized metadata for governance
2025 adoption — AWS, Google Cloud and Azure support Iceberg natively. Strong trend

-- Iceberg with Spark SQL
CREATE TABLE catalog.db.orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DOUBLE,
    order_date DATE
) USING iceberg
PARTITIONED BY (days(order_date));

-- Iceberg Time Travel
SELECT * FROM catalog.db.orders
FOR SYSTEM_TIME AS OF '2025-01-01 00:00:00';

-- Iceberg Compaction
CALL catalog.system.rewrite_data_files(
    table => 'db.orders',
    strategy => 'sort',
    sort_order => 'zorder(customer_id, amount)'
);

-- Expire old snapshots
CALL catalog.system.expire_snapshots(
    table => 'db.orders',
    older_than => TIMESTAMP '2025-01-01 00:00:00'
);

Open format — Iceberg is not tied to Databricks. Works with Spark, Trino, Flink, DuckDB, BigQuery Omni
Row-level deletes — DELETE FROM ... WHERE works without rewriting the entire partition. Copy-on-write (default) or Merge-on-read (for frequent DELETEs)
Partition evolution — change the partition strategy without rewriting data. Impossible with Hive
Iceberg Catalog — REST Catalog (Polaris, Nessie), AWS Glue, Hive Metastore. The catalog maintains metadata and snapshots

4Apache Hudi: The CDC Specialist

Discriminating question

When do you use Hudi over Delta or Iceberg?

Hudi — optimized for frequent upserts and CDC (Change Data Capture). Two table types: COW (Copy-on-Write) and MOR (Merge-on-Read)
COW — rewrite files on every update. Fast reads, slow writes
MOR — write updates to delta files, merge at read time. Fast writes, slower reads
Use cases — CDC pipelines from transactional databases (Debezium + Hudi), frequently changing data

5Comparison of the Three Formats

	Delta Lake	Iceberg	Hudi
ACID	Yes	Yes	Yes
Time Travel	Yes (delta_log)	Yes (snapshots)	Yes (commits)
Partition Evolution	Limited	Full	Limited
Multi-engine	Average (Databricks-centric)	Excellent	Good
Upsert/CDC	Good (MERGE)	Good	Excellent (MOR)
Cloud ecosystem	Azure (Fabric), Databricks	AWS, GCP, Azure	AWS EMR

6How to Choose in Practice

Databricks or Azure stack → Delta Lake. Native, very well integrated
Multi-cloud, multi-engine, vendor independence → Iceberg. Emerging standard
Intensive CDC from transactional databases → Hudi. Optimized upserts
Greenfield 2025 → Iceberg. The ecosystem is converging toward Iceberg as the open standard

Delta Lake — de facto standard in the Databricks/Spark ecosystem. Best performance with Photon. Open source since 2019
Apache Iceberg — open multi-engine format. Strategic choice to avoid Databricks vendor lock-in. Adopted by Netflix, Apple, AWS
Apache Hudi — optimized for frequent upserts (fintech, IoT use cases). Less widespread than Delta/Iceberg in 2025
2025 trend — Iceberg is gaining ground as AWS, Google and Snowflake support it natively. Delta remains dominant on Databricks stacks

2025 trend: Delta Lake and Hudi have announced compatibility with the Iceberg format. The format war is coming to an end in favor of Iceberg as the read standard.

7Level Grid

Level	Mastery	GO Signal	NO-GO
Mid-level	Delta Lake or Iceberg, ACID, Time Travel, MERGE	Explains why Parquet alone is not enough, has used Delta MERGE	Does not know what a table format is
Senior	Delta/Iceberg/Hudi comparison, choice based on context	Justifies table format choice based on ecosystem, knows Iceberg partition evolution	Only knows one table format

Vous recrutez un Data Engineer architecte lakehouse ?

Premier entretien gratuit. Rapport GO/NO-GO sous 48h.

Tester gratuitement Reserver un appel