Accueil›Blog›Test technique Databricks Workflows : orchestration et MLOps

Guide recrutement data

Test technique Databricks Workflows : orchestration et MLOps

Databricks Workflows est l orchestrateur natif de Databricks. En entretien, on évalue la capacité à l utiliser efficacement pour orchestrer des pipelines complexes avec des dépendances et du CI/CD.

Data Builder·Juin 2025·6 min de lecture·Data Engineer

Sommaire

Concepts Workflows
Tasks et dépendances
Git integration
Asset Bundles
CI/CD Databricks
Optimiser les coûts
Grille

1Databricks Workflows vs Airflow

Question discriminante

Quand choisissez-vous Databricks Workflows plutôt qu'Airflow ?

Databricks Workflows — orchestrateur natif intégré dans la plateforme. Zéro infrastructure à gérer, monitoring dans l'UI Databricks, accès direct aux clusters
Avantages vs Airflow — pas de DAG Python à maintenir, clusters partagés entre tasks, monitoring unifié avec les notebooks, retry natif avec reprise depuis la task échouée
Limites vs Airflow — moins de connecteurs (pas de provider ecosystem), difficile d'orchestrer des ressources hors Databricks, moins mature pour les pipelines complexes
Règle pratique — si 90% de votre traitement est dans Databricks : Workflows. Si vous orchestrez des services externes (Fivetran, dbt Cloud, APIs) : Airflow ou Prefect
Coexistence — beaucoup d'équipes utilisent Airflow pour orchestrer et déclenchent des jobs Databricks via l'opérateur DatabricksSubmitRunOperator

# Déclencher un job Databricks depuis Airflow
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator

dbt_task = DatabricksSubmitRunOperator(
    task_id='run_dbt_databricks',
    databricks_conn_id='databricks_default',
    new_cluster={
        'spark_version': '14.3.x-scala2.12',
        'node_type_id': 'i3.xlarge',
        'num_workers': 4
    },
    notebook_task={
        'notebook_path': '/Repos/production/dbt_run',
        'base_parameters': {'env': 'production', 'target': 'prod'}
    }
)

2Tasks et dépendances complexes

Question discriminante

Quels types de tasks Databricks Workflows supporte-t-il ? Comment gérez-vous les dépendances ?

Notebook task — exécuter un notebook Databricks avec des paramètres. Le type le plus courant
Python script — exécuter un script .py depuis un repo Git ou un volume Unity Catalog
dbt task — intégration native avec dbt Core ou dbt Cloud. Lance les modèles dbt sur un cluster Databricks
SQL task — exécuter des requêtes SQL sur un SQL Warehouse (serverless ou classique)
Pipeline task — déclencher un pipeline Delta Live Tables
Conditions run-if — ALL_SUCCESS (défaut), AT_LEAST_ONE_SUCCESS, NONE_FAILED, ALL_DONE, AT_LEAST_ONE_FAILED. Permet des branches conditionnelles

# Exemple de workflow JSON (config as code) { "name": "pipeline-data-quotidien", "tasks": [ { "task_key": "ingestion", "notebook_task": {"notebook_path": "/Repos/prod/ingestion"}, "job_cluster_key": "ingestion-cluster" }, { "task_key": "transformation", "depends_on": [{"task_key": "ingestion"}], "dbt_task": { "project_directory": "/Repos/prod/dbt_project", "commands": ["dbt run --select tag:daily"] }, "job_cluster_key": "transform-cluster" }, { "task_key": "notification", "depends_on": [{"task_key": "transformation"}], "run_if": "ALL_SUCCESS", "notebook_task": {"notebook_path": "/Repos/prod/notify_success"} } ] }

3Git integration dans Databricks

Question discriminante

Comment intégrez-vous Git avec Databricks Repos pour un workflow CI/CD ?

Databricks Repos — cloner un repo GitHub/GitLab/Bitbucket directement dans Databricks. Synchronisation manuelle ou via API
Branches par environnement — main → production, develop → staging, feature/* → dev personnel
Jobs pointent sur des branches — le job de prod pointe sur main, le job de staging sur develop. Promotion = merge PR
Code review obligatoire — les notebooks sont des fichiers Python dans le repo. Pull requests normales avec revue de code

# Mettre à jour un repo Databricks via API (depuis CI/CD)
import requests

def update_databricks_repo(repo_id, branch):
    resp = requests.patch(
        f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
        headers={"Authorization": f"Bearer {TOKEN}"},
        json={"branch": branch}
    )
    return resp.json()

# Dans GitHub Actions : update repo → trigger job
update_databricks_repo(PROD_REPO_ID, "main")
trigger_job(PROD_JOB_ID)

4Asset Bundles : IaC pour Databricks

Question discriminante

Qu'est-ce que Databricks Asset Bundles ? Pourquoi remplacer les configurations manuelles ?

DAB (Databricks Asset Bundles) — définir jobs, pipelines DLT, permissions, clusters en YAML versionné dans Git
Environnements — dev/staging/prod avec des variables différentes (cluster size, credentials, targets) dans le même fichier
Deploy via CLI — databricks bundle deploy synchronise la configuration dans l'environnement cible
vs configuration manuelle — la config manuelle dans l'UI n'est pas versionnée, pas reproductible, pas testable

# databricks.yml - Asset Bundle bundle: name: pipeline-ventes variables: cluster_size: default: "Small" targets: dev: mode: development variables: cluster_size: "Small" prod: mode: production variables: cluster_size: "Large" resources: jobs: pipeline_quotidien: name: "Pipeline Ventes Quotidien" schedule: quartz_cron_expression: "0 0 6 * * ?" tasks: - task_key: ingestion notebook_task: notebook_path: ./notebooks/ingestion.py

5CI/CD avec Databricks Asset Bundles

Question discriminante

Comment configurez-vous un pipeline CI/CD pour déployer des jobs Databricks ?

# .github/workflows/deploy.yml name: Deploy Databricks on: push: branches: [main] pull_request: branches: [main] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Databricks CLI uses: databricks/setup-cli@main - name: Validate bundle (PR) if: github.event_name == 'pull_request' run: databricks bundle validate --target staging env: DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }} DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }} - name: Deploy to production (main) if: github.ref == 'refs/heads/main' run: databricks bundle deploy --target prod env: DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }} DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}

6Optimiser les coûts des Workflows

Question discriminante

Comment réduire les coûts de vos jobs Databricks en production ?

Job clusters vs All-purpose — toujours utiliser des job clusters pour la production. Ils s'éteignent à la fin du job. Les all-purpose restent actifs et coûtent en permanence
Spot instances pour les workers — économie de 50-80%. Configurer un fallback on-demand si les spots ne sont pas disponibles
Photon Engine — activer Photon sur les tasks SQL/DataFrame. Jusqu'à 12x plus rapide = moins de DBUs consommés
Serverless pour les tâches SQL — SQL Warehouse serverless : facturation à la seconde, démarrage rapide, zéro gestion
Taille de cluster adaptée — commencer petit (2-4 workers) et mesurer. Beaucoup de pipelines n'ont pas besoin de 8 workers

7Grille par niveau

Niveau	Maîtrise	Signal GO	NO-GO
Confirmé	Jobs Workflows, types de tasks, Git integration, job clusters	A configuré des jobs avec dépendances, utilise des job clusters, pointe sur des branches Git	Exécute en production sur des all-purpose clusters, config manuelle dans l'UI
Senior	Asset Bundles, CI/CD GitHub Actions, optimisation coûts, spot instances	A mis en place un pipeline CI/CD avec DAB, réduit les coûts via spot + Photon	Ne sait pas ce que sont les Asset Bundles, pas de CI/CD sur les jobs Databricks

1Databricks Workflows vs Airflow

Discriminating question

When do you choose Databricks Workflows over Airflow?

Databricks Workflows — native orchestrator built into the platform. Zero infrastructure to manage, monitoring in the Databricks UI, direct access to clusters
Advantages vs Airflow — no Python DAG to maintain, clusters shared between tasks, unified monitoring with notebooks, native retry with resume from failed task
Limitations vs Airflow — fewer connectors (no provider ecosystem), difficult to orchestrate resources outside Databricks, less mature for complex pipelines
Practical rule — if 90% of your processing is in Databricks: Workflows. If you orchestrate external services (Fivetran, dbt Cloud, APIs): Airflow or Prefect
Coexistence — many teams use Airflow for orchestration and trigger Databricks jobs via the DatabricksSubmitRunOperator operator

# Trigger a Databricks job from Airflow
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator

dbt_task = DatabricksSubmitRunOperator(
    task_id='run_dbt_databricks',
    databricks_conn_id='databricks_default',
    new_cluster={
        'spark_version': '14.3.x-scala2.12',
        'node_type_id': 'i3.xlarge',
        'num_workers': 4
    },
    notebook_task={
        'notebook_path': '/Repos/production/dbt_run',
        'base_parameters': {'env': 'production', 'target': 'prod'}
    }
)

2Tasks and complex dependencies

Discriminating question

What types of tasks does Databricks Workflows support? How do you manage dependencies?

Notebook task — run a Databricks notebook with parameters. The most common type
Python script — run a .py script from a Git repo or a Unity Catalog volume
dbt task — native integration with dbt Core or dbt Cloud. Runs dbt models on a Databricks cluster
SQL task — run SQL queries on a SQL Warehouse (serverless or classic)
Pipeline task — trigger a Delta Live Tables pipeline
run-if conditions — ALL_SUCCESS (default), AT_LEAST_ONE_SUCCESS, NONE_FAILED, ALL_DONE, AT_LEAST_ONE_FAILED. Enables conditional branching

# Example workflow JSON (config as code)
{
  "name": "daily-data-pipeline",
  "tasks": [
    {
      "task_key": "ingestion",
      "notebook_task": {"notebook_path": "/Repos/prod/ingestion"},
      "job_cluster_key": "ingestion-cluster"
    },
    {
      "task_key": "transformation",
      "depends_on": [{"task_key": "ingestion"}],
      "dbt_task": {
        "project_directory": "/Repos/prod/dbt_project",
        "commands": ["dbt run --select tag:daily"]
      },
      "job_cluster_key": "transform-cluster"
    },
    {
      "task_key": "notification",
      "depends_on": [{"task_key": "transformation"}],
      "run_if": "ALL_SUCCESS",
      "notebook_task": {"notebook_path": "/Repos/prod/notify_success"}
    }
  ]
}

3Git integration in Databricks

Discriminating question

How do you integrate Git with Databricks Repos for a CI/CD workflow?

Databricks Repos — clone a GitHub/GitLab/Bitbucket repo directly into Databricks. Manual or API-based synchronization
Branches per environment — main → production, develop → staging, feature/* → personal dev
Jobs point to branches — the prod job points to main, the staging job to develop. Promotion = PR merge
Mandatory code review — notebooks are Python files in the repo. Standard pull requests with code review

# Update a Databricks repo via API (from CI/CD)
import requests

def update_databricks_repo(repo_id, branch):
    resp = requests.patch(
        f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
        headers={"Authorization": f"Bearer {TOKEN}"},
        json={"branch": branch}
    )
    return resp.json()

# In GitHub Actions: update repo → trigger job
update_databricks_repo(PROD_REPO_ID, "main")
trigger_job(PROD_JOB_ID)

4Asset Bundles: IaC for Databricks

Discriminating question

What are Databricks Asset Bundles? Why replace manual configurations?

DAB (Databricks Asset Bundles) — define jobs, DLT pipelines, permissions, clusters in versioned YAML in Git
Environments — dev/staging/prod with different variables (cluster size, credentials, targets) in the same file
Deploy via CLI — databricks bundle deploy synchronizes the configuration into the target environment
vs manual configuration — manual config in the UI is not versioned, not reproducible, not testable

# databricks.yml - Asset Bundle
bundle:
  name: sales-pipeline

variables:
  cluster_size:
    default: "Small"

targets:
  dev:
    mode: development
    variables:
      cluster_size: "Small"
  prod:
    mode: production
    variables:
      cluster_size: "Large"

resources:
  jobs:
    pipeline_quotidien:
      name: "Daily Sales Pipeline"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
      tasks:
        - task_key: ingestion
          notebook_task:
            notebook_path: ./notebooks/ingestion.py

5CI/CD with Databricks Asset Bundles

Discriminating question

How do you configure a CI/CD pipeline to deploy Databricks jobs?

# .github/workflows/deploy.yml
name: Deploy Databricks
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Databricks CLI
        uses: databricks/setup-cli@main
      
      - name: Validate bundle (PR)
        if: github.event_name == 'pull_request'
        run: databricks bundle validate --target staging
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
      
      - name: Deploy to production (main)
        if: github.ref == 'refs/heads/main'
        run: databricks bundle deploy --target prod
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_PROD_TOKEN }}

6Optimizing Workflow costs

Discriminating question

How do you reduce the costs of your Databricks jobs in production?

Job clusters vs All-purpose — always use job clusters for production. They shut down at the end of the job. All-purpose clusters remain active and incur continuous costs
Spot instances for workers — 50-80% savings. Configure an on-demand fallback if spot instances are unavailable
Photon Engine — enable Photon on SQL/DataFrame tasks. Up to 12x faster = fewer DBUs consumed
Serverless for SQL tasks — serverless SQL Warehouse: per-second billing, fast startup, zero management
Right-sized clusters — start small (2-4 workers) and measure. Many pipelines don't need 8 workers

7Level grid

Level	Mastery	GO signal	NO-GO
Mid-level	Workflows jobs, task types, Git integration, job clusters	Has configured jobs with dependencies, uses job clusters, points to Git branches	Runs in production on all-purpose clusters, manual config in the UI
Senior	Asset Bundles, CI/CD GitHub Actions, cost optimization, spot instances	Has set up a CI/CD pipeline with DAB, reduced costs via spot + Photon	Does not know what Asset Bundles are, no CI/CD on Databricks jobs

Vous recrutez un Data Engineer Databricks ?

Premier entretien gratuit. Rapport GO/NO-GO sous 48h.

Tester gratuitement Reserver un appel