Databricks est la plateforme lakehouse de reference. En entretien, on evalue la maitrise de Unity Catalog pour la gouvernance, des Jobs pour l orchestration et de l integration avec MLflow.
1Unity Catalog : gouvernance centralisée
Question discriminante
Qu'est-ce que Unity Catalog ? Pourquoi remplace-t-il Hive Metastore ?
- Unity Catalog — couche de gouvernance unifiée Databricks : gère les permissions sur tables, vues, volumes (fichiers), modèles ML et functions dans un seul endroit
- Hiérarchie — Catalog → Schema (Database) → Table/View/Volume. Notation 3 niveaux :
catalog.schema.table
- Row-level et Column-level security — masquage dynamique des colonnes PII selon le rôle, filtrage de lignes par user
- Data lineage — Unity Catalog trace automatiquement le lineage colonne par colonne entre tables
- vs Hive Metastore — Hive Metastore est limité à un workspace. Unity Catalog est cross-workspace et cross-cloud
-- Créer un catalog et un schema avec Unity Catalog
CREATE CATALOG IF NOT EXISTS prod_catalog
COMMENT 'Catalogue de production';
CREATE SCHEMA IF NOT EXISTS prod_catalog.analytics
COMMENT 'Données analytiques';
-- Attribuer des permissions
GRANT USE CATALOG ON CATALOG prod_catalog TO `data-analysts`;
GRANT SELECT ON SCHEMA prod_catalog.analytics TO `data-analysts`;
GRANT MODIFY ON TABLE prod_catalog.analytics.orders TO `data-engineers`;
-- Masquage dynamique d'une colonne PII
CREATE OR REPLACE FUNCTION prod_catalog.security.mask_email(email STRING)
RETURNS STRING
RETURN CASE WHEN is_member('pii-readers') THEN email
ELSE REGEXP_REPLACE(email, '.+@', '****@') END;
ALTER TABLE prod_catalog.analytics.users
ALTER COLUMN email SET MASK prod_catalog.security.mask_email;
2Jobs et Workflows : orchestration Databricks
Question discriminante
Quand utilisez-vous Databricks Workflows plutôt qu'Airflow ?
- Databricks Workflows — orchestrateur natif pour les jobs Databricks. Idéal quand tout le traitement est dans Databricks et que l'équipe veut éviter une infra Airflow séparée
- Tasks — notebook, script Python, dbt, Spark JAR, SQL. Chaque task peut avoir son propre cluster ou partager un cluster job
- Dépendances — graph DAG entre tasks avec conditions (run-if: all_success, at_least_one_success, none_failed)
- Job clusters vs All-purpose clusters — job clusters s'éteignent à la fin du job. All-purpose restent actifs. Toujours utiliser des job clusters en production
- Triggers — CRON, event-based (file arrival dans un volume), continu, ou déclenché par API
# Déclencher un job Databricks via API depuis Airflow
import requests
def trigger_databricks_job(job_id, params):
resp = requests.post(
f"{DATABRICKS_HOST}/api/2.1/jobs/run-now",
headers={"Authorization": f"Bearer {DATABRICKS_TOKEN}"},
json={"job_id": job_id, "notebook_params": params}
)
run_id = resp.json()['run_id']
return run_id
3Delta Live Tables : pipelines déclaratifs
Question discriminante
Qu'est-ce que Delta Live Tables ? En quoi diffère-t-il d'un pipeline dbt ?
- DLT — framework Databricks pour définir des pipelines de données comme des tables Python/SQL avec des attentes de qualité (expectations)
- @dlt.table — déclare une table, Databricks gère l'orchestration, le rafraîchissement incrémental et la gestion des erreurs
- Expectations — règles de qualité intégrées : warn, drop (rejeter les lignes invalides) ou fail (bloquer le pipeline)
- vs dbt — DLT intègre Spark nativement, gère le streaming. dbt est SQL-first, plus portable, meilleur écosystème de tests
- Streaming tables vs Materialized Views — Streaming Tables : ingestion continue depuis Kafka/Auto Loader. MV : agrégations recalculées périodiquement
import dlt
from pyspark.sql.functions import col
@dlt.table(comment="Commandes brutes depuis S3")
def raw_orders():
return spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("/mnt/landing/orders/")
@dlt.table(comment="Commandes validées")
@dlt.expect_or_drop("montant_positif", col("amount") > 0)
@dlt.expect_or_warn("email_valide", col("email").rlike(r".+@.+\..+"))
def validated_orders():
return dlt.read_stream("raw_orders") .filter(col("status") != "cancelled") .select("order_id", "customer_id", "amount", "email", "created_at")
4MLflow intégré dans Databricks
Question discriminante
Comment Databricks intègre-t-il MLflow dans le cycle de vie des modèles ML ?
- MLflow autolog — Databricks active l'autologging MLflow automatiquement : chaque run sklearn/XGBoost/PyTorch est tracé sans code supplémentaire
- Model Registry — versionner et promouvoir les modèles (Staging → Production). Avec Unity Catalog :
models:/catalog.schema.model/version
- Model Serving — déployer un modèle en endpoint REST en 2 clics depuis le Model Registry
- Feature Store — stocker et réutiliser des features ML entre différents modèles et équipes
5Optimisation compute
Question discriminante
Comment réduire les coûts compute dans Databricks ?
- Photon Engine — moteur d'exécution vectorisé en C++. Accélère les requêtes SQL et DataFrames jusqu'à 12x sans changer le code
- Auto-termination — configurer un timeout d'inactivité sur tous les clusters interactifs (20-30 min max)
- Spot instances — utiliser des instances spot AWS/Azure/GCP pour les workers (économie 50-80%). Configurer un fallback on-demand
- OPTIMIZE et ZORDER — compacter les petits fichiers Delta et trier les données pour améliorer le pruning
- Serverless SQL — pour les requêtes SQL analytiques : pas de cluster à gérer, facturation par seconde de compute
-- Optimiser une table Delta en production
OPTIMIZE prod_catalog.analytics.orders
ZORDER BY (order_date, customer_id);
-- Analyser les statistiques pour le query planner
ANALYZE TABLE prod_catalog.analytics.orders
COMPUTE STATISTICS FOR ALL COLUMNS;
-- Vacuum : supprimer les fichiers anciens (>7j par défaut)
VACUUM prod_catalog.analytics.orders RETAIN 168 HOURS;
6Serverless et architecture moderne
Question discriminante
Qu'est-ce que le serverless Databricks ? Quand l'utiliser ?
- Serverless SQL Warehouse — warehouse géré par Databricks, démarrage en <5s, facturation à la seconde. Idéal pour le BI et les dashboards
- Serverless Jobs — exécuter des jobs sans gérer de clusters. Databricks gère le provisioning automatiquement
- Lakehouse Federation — requêter BigQuery, Snowflake, Redshift, PostgreSQL directement depuis Databricks sans déplacer les données
- Quand NOT serverless — workloads Spark très lourds avec besoins mémoire spécifiques, jobs GPU pour le ML
7Grille par niveau
| Niveau | Maîtrise | Signal GO | NO-GO |
| Confirmé | Notebooks, Jobs basiques, Delta Lake, Unity Catalog permissions | A créé des jobs Databricks, gère des tables Delta, utilise Unity Catalog | Travaille sur All-purpose clusters en production, n'utilise pas Delta |
| Senior | DLT, MLflow, optimisation compute (Photon, ZORDER), architecture Lakehouse | A déployé des pipelines DLT, réduit les coûts via spot + auto-termination + OPTIMIZE | Ne sait pas ce que sont Delta Live Tables ni Unity Catalog |
1Unity Catalog: centralized governance
Discriminating question
What is Unity Catalog? Why does it replace Hive Metastore?
- Unity Catalog — unified Databricks governance layer: manages permissions on tables, views, volumes (files), ML models and functions in a single place
- Hierarchy — Catalog → Schema (Database) → Table/View/Volume. 3-level notation:
catalog.schema.table
- Row-level and Column-level security — dynamic masking of PII columns based on role, row filtering by user
- Data lineage — Unity Catalog automatically tracks lineage column by column between tables
- vs Hive Metastore — Hive Metastore is limited to a single workspace. Unity Catalog is cross-workspace and cross-cloud
-- Create a catalog and a schema with Unity Catalog
CREATE CATALOG IF NOT EXISTS prod_catalog
COMMENT 'Production catalog';
CREATE SCHEMA IF NOT EXISTS prod_catalog.analytics
COMMENT 'Analytical data';
-- Assign permissions
GRANT USE CATALOG ON CATALOG prod_catalog TO `data-analysts`;
GRANT SELECT ON SCHEMA prod_catalog.analytics TO `data-analysts`;
GRANT MODIFY ON TABLE prod_catalog.analytics.orders TO `data-engineers`;
-- Dynamic masking of a PII column
CREATE OR REPLACE FUNCTION prod_catalog.security.mask_email(email STRING)
RETURNS STRING
RETURN CASE WHEN is_member('pii-readers') THEN email
ELSE REGEXP_REPLACE(email, '.+@', '****@') END;
ALTER TABLE prod_catalog.analytics.users
ALTER COLUMN email SET MASK prod_catalog.security.mask_email;
2Jobs and Workflows: Databricks orchestration
Discriminating question
When do you use Databricks Workflows instead of Airflow?
- Databricks Workflows — native orchestrator for Databricks jobs. Ideal when all processing is within Databricks and the team wants to avoid a separate Airflow infrastructure
- Tasks — notebook, Python script, dbt, Spark JAR, SQL. Each task can have its own cluster or share a job cluster
- Dependencies — DAG graph between tasks with conditions (run-if: all_success, at_least_one_success, none_failed)
- Job clusters vs All-purpose clusters — job clusters shut down at the end of the job. All-purpose clusters remain active. Always use job clusters in production
- Triggers — CRON, event-based (file arrival in a volume), continuous, or triggered via API
# Trigger a Databricks job via API from Airflow
import requests
def trigger_databricks_job(job_id, params):
resp = requests.post(
f"{DATABRICKS_HOST}/api/2.1/jobs/run-now",
headers={"Authorization": f"Bearer {DATABRICKS_TOKEN}"},
json={"job_id": job_id, "notebook_params": params}
)
run_id = resp.json()['run_id']
return run_id
3Delta Live Tables: declarative pipelines
Discriminating question
What is Delta Live Tables? How does it differ from a dbt pipeline?
- DLT — Databricks framework for defining data pipelines as Python/SQL tables with quality expectations
- @dlt.table — declares a table; Databricks handles orchestration, incremental refresh and error management
- Expectations — built-in quality rules: warn, drop (reject invalid rows) or fail (block the pipeline)
- vs dbt — DLT integrates Spark natively and handles streaming. dbt is SQL-first, more portable, with a better testing ecosystem
- Streaming tables vs Materialized Views — Streaming Tables: continuous ingestion from Kafka/Auto Loader. MV: aggregations recalculated periodically
import dlt
from pyspark.sql.functions import col
@dlt.table(comment="Raw orders from S3")
def raw_orders():
return spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("/mnt/landing/orders/")
@dlt.table(comment="Validated orders")
@dlt.expect_or_drop("positive_amount", col("amount") > 0)
@dlt.expect_or_warn("valid_email", col("email").rlike(r".+@.+\..+"))
def validated_orders():
return dlt.read_stream("raw_orders") .filter(col("status") != "cancelled") .select("order_id", "customer_id", "amount", "email", "created_at")
4MLflow integrated in Databricks
Discriminating question
How does Databricks integrate MLflow into the ML model lifecycle?
- MLflow autolog — Databricks automatically enables MLflow autologging: every sklearn/XGBoost/PyTorch run is tracked without additional code
- Model Registry — version and promote models (Staging → Production). With Unity Catalog:
models:/catalog.schema.model/version
- Model Serving — deploy a model as a REST endpoint in 2 clicks from the Model Registry
- Feature Store — store and reuse ML features across different models and teams
5Compute optimization
Discriminating question
How do you reduce compute costs in Databricks?
- Photon Engine — vectorized execution engine written in C++. Speeds up SQL queries and DataFrames by up to 12x without changing code
- Auto-termination — configure an inactivity timeout on all interactive clusters (20-30 min max)
- Spot instances — use AWS/Azure/GCP spot instances for workers (50-80% savings). Configure an on-demand fallback
- OPTIMIZE and ZORDER — compact small Delta files and sort data to improve pruning
- Serverless SQL — for analytical SQL queries: no cluster to manage, billed per second of compute
-- Optimize a Delta table in production
OPTIMIZE prod_catalog.analytics.orders
ZORDER BY (order_date, customer_id);
-- Analyze statistics for the query planner
ANALYZE TABLE prod_catalog.analytics.orders
COMPUTE STATISTICS FOR ALL COLUMNS;
-- Vacuum: remove old files (>7 days by default)
VACUUM prod_catalog.analytics.orders RETAIN 168 HOURS;
6Serverless and modern architecture
Discriminating question
What is Databricks serverless? When should you use it?
- Serverless SQL Warehouse — warehouse managed by Databricks, starts in <5s, billed per second. Ideal for BI and dashboards
- Serverless Jobs — run jobs without managing clusters. Databricks handles provisioning automatically
- Lakehouse Federation — query BigQuery, Snowflake, Redshift, PostgreSQL directly from Databricks without moving data
- When NOT serverless — very heavy Spark workloads with specific memory requirements, GPU jobs for ML
7Level grid
| Level | Mastery | GO signal | NO-GO |
| Mid-level | Notebooks, basic Jobs, Delta Lake, Unity Catalog permissions | Has created Databricks jobs, manages Delta tables, uses Unity Catalog | Works on All-purpose clusters in production, does not use Delta |
| Senior | DLT, MLflow, compute optimization (Photon, ZORDER), Lakehouse architecture | Has deployed DLT pipelines, reduced costs via spot + auto-termination + OPTIMIZE | Does not know what Delta Live Tables or Unity Catalog are |