Le Lakehouse est l architecture data dominante en 2025. En entretien Architecture ou Lead, on évalue la capacité à concevoir une architecture Lakehouse robuste, performante et économique.
1Organisation en zones : medallion architecture
Question discriminante
Qu est-ce que la medallion architecture ? Comment organisez-vous vos zones ?
# Medallion Architecture : Bronze → Silver → Gold
# BRONZE (raw)
# - Données brutes, immuables
# - Partitionnement par date d ingestion
# - Rétention longue (7 ans)
s3://datalake/bronze/
orders/ingestion_date=2025-01-15/
part-00000.parquet
# SILVER (curated)
# - Données nettoyées, déduplicées
# - Table Delta/Iceberg avec schema enforced
# - Tests de qualité passés
s3://datalake/silver/
orders/ (Delta table)
# GOLD (consumption)
# - Agrégations métier
# - Optimisé pour la BI et le ML
# - Accès contrôlé par rôle
s3://datalake/gold/
fct_revenue/
dim_customers/
- Bronze immuable — ne jamais modifier les données brutes. En cas de bug, reprocesser depuis le bronze
- Silver : source de vérité — données de confiance, testées, accessibles aux Data Scientists
- Gold : orienté consommation — pré-agrégé pour la performance BI, partitionnement optimisé
Question discriminante
Comment choisissez-vous entre Delta Lake, Iceberg et Hudi pour votre Lakehouse ?
- Delta Lake — choisir sur Azure (Fabric, Databricks), quand l équipe est full Spark/Databricks
- Apache Iceberg — choisir sur AWS (Athena, EMR) ou GCP, pour l interopérabilité multi-engine (Spark + Trino + Flink)
- Hudi — choisir pour les CDC intensifs depuis des bases transactionnelles (beaucoup d upserts)
- Tendance 2025 — Iceberg devient le standard de fait. Delta et Hudi ajoutent la compatibilité Iceberg
3Compaction : le problème des small files
Question discriminante
Qu est-ce que le problème des small files ? Comment le résolvez-vous ?
# Problème : le streaming crée des milliers de petits fichiers
# -> les lectures sont lentes (overhead par fichier)
# Delta Lake : OPTIMIZE compacte les petits fichiers
from delta.tables import DeltaTable
delta = DeltaTable.forPath(spark, '/datalake/silver/orders')
delta.optimize().executeCompaction()
# Planifier la compaction (toutes les nuits via Airflow)
@task
def compact_delta_table(path: str):
delta = DeltaTable.forPath(spark, path)
delta.optimize().executeZOrderBy('region', 'order_date')
# Iceberg : rewrite_data_files
SPARK.sql("""
CALL spark_catalog.system.rewrite_data_files(
table => 'silver.orders',
strategy => 'sort',
sort_order => 'region, order_date'
)
""")
- Small files — des milliers de fichiers de 1MB sont bien plus lents qu un seul fichier de 1GB
- ZORDER — co-localise les données fréquemment filtrées ensemble dans les fichiers
4Vacuum et gestion de la rétention
Question discriminante
Comment gérez-vous la rétention des données et l espace de stockage dans Delta Lake ?
# VACUUM : supprimer les anciens fichiers (time travel)
# Par défaut : garder 7 jours d historique
VACUUM delta.`/datalake/silver/orders` RETAIN 168 HOURS;
# Sur Databricks : vacuum automatique
delta.optimize().executeVacuum(168) # 168 heures = 7 jours
# Attention : après VACUUM, impossible de faire time travel
# au-delà de la rétention configurée
# Iceberg : expire_snapshots
SPARK.sql("""
CALL spark_catalog.system.expire_snapshots(
table => 'silver.orders',
older_than => TIMESTAMP '2025-01-01 00:00:00.000',
retain_last => 10
)
""")
5Catalog : découverte et gouvernance
Question discriminante
Quel catalog utilisez-vous pour gérer les tables de votre Lakehouse ?
- AWS Glue Catalog — catalog managé AWS, compatible Athena, EMR, Glue ETL
- Hive Metastore — standard open source, compatible Spark, Hive, Presto/Trino
- Apache Iceberg REST Catalog — standard émergent, backend interchangeable (JDBC, AWS Glue, Nessie)
- Unity Catalog (Databricks) — gouvernance fine (column-level security, lineage) sur Delta Lake
- BigLake Metastore (GCP) — compatible Iceberg, interopérable avec Spark et BigQuery
6Sécurité et contrôle d accès
Question discriminante
Comment implémentez-vous le contrôle d accès par rôle dans un Lakehouse ?
- Bronze inaccessible aux utilisateurs finaux — accès réservé aux pipelines d ingestion et aux Data Engineers
- Silver accessible aux Data Scientists — lecture seule sur les données curatées
- Gold en lecture pour la BI — les équipes métier n ont accès qu aux marts Gold
- Column-level security — masquer les colonnes PII selon le rôle (Unity Catalog, Ranger)
- Row-level security — filtrer les lignes selon l appartenance géographique ou organisationnelle
- Lakehouse = Data Lake + Data Warehouse - stockage brut ouvert (S3/GCS) + couche transactionnelle (Delta Lake, Iceberg) + SQL analytique. Cout stockage reduit, flexibilite maximale
- Table format = cle du lakehouse - Delta Lake ou Iceberg ajoutent ACID, Time Travel, Schema Evolution sur des fichiers Parquet ordinaires. Sans table format, les fichiers sont non geres
- Medaillon architecture - Bronze (raw, immuable) -> Silver (nettoyage, validation) -> Gold (agregations metier, marts). Standard Databricks adopte par la communaute
- Open Table Format 2025 - Iceberg gagne du terrain face a Delta : AWS, Google, Snowflake le supportent nativement. Delta reste dominant dans l ecosysteme Databricks
- Query engine sur lakehouse - Spark (batch/streaming), Trino (SQL interactif), Flink (streaming), DuckDB (analytique local). Tous lisent les memes fichiers Iceberg/Delta
- vs Data Warehouse - DW : SQL optimise, gouvernance forte, cout eleve a grande echelle. Lakehouse : flexibilite (Python, ML, streaming), cout stockage moindre, complexite operationnelle plus elevee
7Grille par niveau
| Niveau | Maitrise | Signal GO | NO-GO |
|---|
| Confirmé | Medallion architecture, Delta/Iceberg basique | Organise ses données en zones Bronze/Silver/Gold, connaît Delta et Iceberg | Stocke toutes les données dans un seul dossier S3 plat |
| Senior | Compaction, vacuum, catalog, sécurité par zone | Planifie la compaction automatiquement, configure le vacuum, gère la sécurité par zone | Ne sait pas ce que sont les small files ni comment les résoudre |
| Lead | Choix du table format justifié, architecture multi-cloud, gouvernance | Justifie le choix Delta vs Iceberg selon le contexte, a conçu une architecture Lakehouse from scratch | Ne peut pas expliquer pourquoi Iceberg est préféré à Delta en contexte multi-cloud |
1Zone organization: medallion architecture
Discriminating question
What is the medallion architecture? How do you organize your zones?
# Medallion Architecture: Bronze → Silver → Gold
# BRONZE (raw)
# - Raw, immutable data
# - Partitioned by ingestion date
# - Long retention (7 years)
s3://datalake/bronze/
orders/ingestion_date=2025-01-15/
part-00000.parquet
# SILVER (curated)
# - Cleaned, deduplicated data
# - Delta/Iceberg table with schema enforced
# - Quality tests passed
s3://datalake/silver/
orders/ (Delta table)
# GOLD (consumption)
# - Business aggregations
# - Optimized for BI and ML
# - Role-controlled access
s3://datalake/gold/
fct_revenue/
dim_customers/
- Immutable Bronze — never modify raw data. In case of a bug, reprocess from Bronze
- Silver: source of truth — trusted, tested data, accessible to Data Scientists
- Gold: consumption-oriented — pre-aggregated for BI performance, optimized partitioning
Discriminating question
How do you choose between Delta Lake, Iceberg and Hudi for your Lakehouse?
- Delta Lake — choose on Azure (Fabric, Databricks), when the team is full Spark/Databricks
- Apache Iceberg — choose on AWS (Athena, EMR) or GCP, for multi-engine interoperability (Spark + Trino + Flink)
- Hudi — choose for intensive CDC from transactional databases (many upserts)
- 2025 trend — Iceberg is becoming the de facto standard. Delta and Hudi are adding Iceberg compatibility
3Compaction: the small files problem
Discriminating question
What is the small files problem? How do you solve it?
# Problem: streaming creates thousands of small files
# -> reads are slow (per-file overhead)
# Delta Lake: OPTIMIZE compacts small files
from delta.tables import DeltaTable
delta = DeltaTable.forPath(spark, '/datalake/silver/orders')
delta.optimize().executeCompaction()
# Schedule compaction (every night via Airflow)
@task
def compact_delta_table(path: str):
delta = DeltaTable.forPath(spark, path)
delta.optimize().executeZOrderBy('region', 'order_date')
# Iceberg: rewrite_data_files
SPARK.sql("""
CALL spark_catalog.system.rewrite_data_files(
table => 'silver.orders',
strategy => 'sort',
sort_order => 'region, order_date'
)
""")
- Small files — thousands of 1MB files are much slower than a single 1GB file
- ZORDER — co-locates frequently filtered data together within files
4Vacuum and retention management
Discriminating question
How do you manage data retention and storage space in Delta Lake?
# VACUUM: delete old files (time travel)
# Default: keep 7 days of history
VACUUM delta.`/datalake/silver/orders` RETAIN 168 HOURS;
# On Databricks: automatic vacuum
delta.optimize().executeVacuum(168) # 168 hours = 7 days
# Warning: after VACUUM, time travel is impossible
# beyond the configured retention
# Iceberg: expire_snapshots
SPARK.sql("""
CALL spark_catalog.system.expire_snapshots(
table => 'silver.orders',
older_than => TIMESTAMP '2025-01-01 00:00:00.000',
retain_last => 10
)
""")
5Catalog: discovery and governance
Discriminating question
Which catalog do you use to manage the tables in your Lakehouse?
- AWS Glue Catalog — managed AWS catalog, compatible with Athena, EMR, Glue ETL
- Hive Metastore — open source standard, compatible with Spark, Hive, Presto/Trino
- Apache Iceberg REST Catalog — emerging standard, interchangeable backend (JDBC, AWS Glue, Nessie)
- Unity Catalog (Databricks) — fine-grained governance (column-level security, lineage) on Delta Lake
- BigLake Metastore (GCP) — Iceberg-compatible, interoperable with Spark and BigQuery
6Security and access control
Discriminating question
How do you implement role-based access control in a Lakehouse?
- Bronze inaccessible to end users — access reserved for ingestion pipelines and Data Engineers
- Silver accessible to Data Scientists — read-only access on curated data
- Gold read access for BI — business teams only have access to Gold marts
- Column-level security — mask PII columns based on role (Unity Catalog, Ranger)
- Row-level security — filter rows based on geographic or organizational membership
- Lakehouse = Data Lake + Data Warehouse - open raw storage (S3/GCS) + transactional layer (Delta Lake, Iceberg) + analytical SQL. Reduced storage cost, maximum flexibility
- Table format = key to the lakehouse - Delta Lake or Iceberg add ACID, Time Travel, Schema Evolution on top of ordinary Parquet files. Without a table format, files are unmanaged
- Medallion architecture - Bronze (raw, immutable) -> Silver (cleaning, validation) -> Gold (business aggregations, marts). Databricks standard adopted by the community
- Open Table Format 2025 - Iceberg is gaining ground over Delta: AWS, Google, Snowflake support it natively. Delta remains dominant in the Databricks ecosystem
- Query engine on lakehouse - Spark (batch/streaming), Trino (interactive SQL), Flink (streaming), DuckDB (local analytics). All read the same Iceberg/Delta files
- vs Data Warehouse - DW: optimized SQL, strong governance, high cost at scale. Lakehouse: flexibility (Python, ML, streaming), lower storage cost, higher operational complexity
7Level grid
| Level | Mastery | GO signal | NO-GO |
|---|
| Confirmed | Medallion architecture, basic Delta/Iceberg | Organizes data into Bronze/Silver/Gold zones, knows Delta and Iceberg | Stores all data in a single flat S3 folder |
| Senior | Compaction, vacuum, catalog, zone-level security | Schedules compaction automatically, configures vacuum, manages zone-level security | Does not know what small files are or how to solve them |
| Lead | Justified table format choice, multi-cloud architecture, governance | Justifies the Delta vs Iceberg choice based on context, has designed a Lakehouse architecture from scratch | Cannot explain why Iceberg is preferred over Delta in a multi-cloud context |