Accueil›Blog›Test technique Airflow avancé

Guide recrutement data

Test technique Airflow avancé : DAGs, XCom, TaskFlow, architecture

Airflow est l'orchestrateur de référence en data engineering. Mais beaucoup de profils se limitent aux DAGs basiques. Voici les concepts avancés qu'on teste pour valider un Data Engineer en production.

Data Builder·Juin 2025·8 min de lecture·Data Engineer

Sommaire

DAGs et opérateurs
TaskFlow API
XCom
Control flow et trigger rules
Architecture Airflow
Airflow en production
Grille par niveau

Airflow n'est pas un framework de traitement — c'est un orchestrateur. Comprendre cette distinction est la première chose qu'on vérifie en entretien. Un Data Engineer qui passe des datasets via XCom ou qui met de la logique métier dans ses DAGs n'a pas encore le niveau Senior.

1DAGs, opérateurs et structure

Question discriminante

Quelle est la différence entre un opérateur, un sensor et un task décorateur TaskFlow ? Dans quel cas utilisez-vous chacun ?

Un DAG (Directed Acyclic Graph) est un graphe orienté sans cycle. Il définit l'ordre d'exécution des tâches, leurs dépendances, leur scheduling — mais pas leur contenu métier. Le DAG est l'orchestrateur, pas le processeur.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    dag_id="mon_pipeline",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@daily",
    catchup=False,
    max_active_runs=1,
) as dag:

    extract = PythonOperator(
        task_id="extraction",
        python_callable=extraire_donnees,
    )

    transform = PythonOperator(
        task_id="transformation",
        python_callable=transformer_donnees,
    )

    extract >> transform  # dépendance : transform attend extract

Operators — tâches prédéfinies : PythonOperator, BashOperator, SparkSubmitOperator, etc.
Sensors — attendent un événement externe (S3FileSensor, HttpSensor, ExternalTaskSensor)
TaskFlow (@task) — fonctions Python décorées, passage automatique de valeurs via XCom
catchup=False — ne pas re-exécuter les runs passés au démarrage du DAG
max_active_runs — limite les exécutions concurrentes du même DAG

Bonne pratique : 1 tâche = 1 responsabilité. Un DAG qui fait extraction + transformation + chargement dans une seule tâche Python est un anti-pattern — il casse l'observabilité et le retry granulaire.

2TaskFlow API : le DAX moderne d'Airflow

Question discriminante

Quels sont les avantages de la TaskFlow API par rapport aux PythonOperators classiques ? Quelles sont ses limites ?

Introduite en Airflow 2.0, la TaskFlow API simplifie radicalement l'écriture des DAGs Python en supprimant le boilerplate XCom explicite.

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule_interval="@daily", start_date=datetime(2025, 1, 1), catchup=False)
def mon_pipeline_taskflow():

    @task
    def extraire() -> dict:
        return {"lignes": 1500, "source": "postgres"}

    @task
    def transformer(data: dict) -> dict:
        return {"lignes_traitees": data["lignes"], "source": data["source"]}

    @task
    def charger(data: dict):
        print(f"Chargement de {data['lignes_traitees']} lignes depuis {data['source']}")

    # Les dépendances sont inférées automatiquement
    charger(transformer(extraire()))

mon_pipeline_taskflow()

Avantages — code plus lisible, XCom implicite, dépendances inférées automatiquement, moins de boilerplate
Limites — ne fonctionne qu'avec des fonctions Python, pas adapté aux opérateurs spécialisés (Spark, BigQuery, etc.)
Task Groups — grouper des tâches visuellement dans l'UI Airflow sans affecter l'exécution
Dynamic Task Mapping — générer dynamiquement le nombre de tâches selon les données

3XCom : communication entre tâches

Question discriminante

Airflow est-il un framework de traitement de données ? Pourquoi ne faut-il pas passer de datasets via XCom ?

XCom (cross-communication) permet aux tâches d'échanger des informations. C'est un mécanisme léger, pas un système de transfert de données.

# Push explicite dans un PythonOperator classique
def ma_tache(**context):
    context['task_instance'].xcom_push(key='nb_lignes', value=1500)

# Pull explicite
def tache_suivante(**context):
    nb = context['task_instance'].xcom_pull(task_ids='ma_tache', key='nb_lignes')
    print(f"Reçu : {nb} lignes")

# Avec TaskFlow : implicite, via le return
@task
def ma_tache() -> int:
    return 1500  # automatiquement pushé en XCom

Limite de taille — XCom est stocké dans la base de métadonnées Airflow (~1GB selon le backend)
Usage correct — passer des métadonnées (nombre de lignes, chemin S3, statut), pas des datasets
Usage incorrect — passer des DataFrames ou des fichiers entiers via XCom
Alternative — écrire les données dans S3/GCS et passer le chemin via XCom

Rappel fondamental : Airflow est un orchestrateur, pas un framework de traitement. Le traitement se fait dans Spark, dbt, ou les services cloud. Airflow déclenche et surveille — il ne traite pas.

4Control flow et trigger rules

Question discriminante

Qu'est-ce qu'une trigger rule et dans quel cas utilisez-vous none_failed_min_one_success ?

Par défaut, Airflow n'exécute une tâche que si toutes ses tâches amont ont réussi. Les trigger rules permettent de modifier ce comportement.

Trigger rule	Condition d'exécution	Cas d'usage
`all_success`	Toutes les tâches amont ont réussi (défaut)	Pipeline standard
`all_failed`	Toutes les tâches amont ont échoué	Notification d'échec global
`all_done`	Toutes les tâches amont sont terminées (succès ou échec)	Nettoyage systématique
`one_success`	Au moins une tâche amont a réussi	Traitement dès qu'une source est disponible
`none_failed`	Aucune tâche amont n'a échoué (succès ou skipped)	Tâche finale après branchement conditionnel
`none_failed_min_one_success`	Aucun échec + au moins un succès	Agrégation après branchement avec succès partiel

Branching — BranchPythonOperator pour exécuter une branche conditionnelle, les autres branches sont skippées
TriggerDagRunOperator — déclencher un autre DAG depuis un DAG
ShortCircuitOperator — court-circuiter toutes les tâches aval si une condition est fausse

5Architecture Airflow

Question discriminante

Quels sont les composants d'une architecture Airflow en production ? Quelle est la différence entre LocalExecutor et CeleryExecutor ?

Une installation Airflow en production comporte plusieurs composants distincts. Connaître l'architecture est indispensable pour déployer et déboguer.

Scheduler — scanne les DAGs, détermine quelles tâches lancer, soumet au executor
Executor — comment les tâches sont exécutées : LocalExecutor (même process), CeleryExecutor (workers distribués), KubernetesExecutor (pod par tâche)
Webserver — UI Airflow, inspection et déclenchement manuel des DAGs
Metadata database — stocke l'état des DAGs, tâches, XCom, connexions (PostgreSQL recommandé en prod)
Workers — processus qui exécutent les tâches (Celery/Kubernetes uniquement)

En production : LocalExecutor convient jusqu'à ~50 DAGs. Au-delà, CeleryExecutor ou KubernetesExecutor. MWAA (AWS Managed Airflow) et Cloud Composer (GCP) gèrent l'infrastructure automatiquement.

6Airflow en production : connexions, SLAs, monitoring

Question discriminante

Comment gérez-vous les credentials dans Airflow ? Pourquoi ne faut-il pas les coder en dur dans les DAGs ?

Connections — stocker les credentials dans la base de métadonnées Airflow, accessibles par conn_id dans les opérateurs
Secrets Backend — intégration avec AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault pour les credentials sensibles
Variables — paramètres de configuration non sensibles, modifiables sans redeployer le DAG
SLAs — définir un temps maximum d'exécution, déclencher une alerte si dépassé
Callbacks — on_failure_callback, on_success_callback, on_retry_callback pour les notifications Slack/PagerDuty
Pools — limiter la concurrence de certaines tâches (ex : ne pas saturer une base de données)

7Grille par niveau

Niveau	Maîtrise attendue	Signal GO	NO-GO
Junior	DAGs basiques, PythonOperator, scheduling, dépendances simples	Comprend qu'Airflow est un orchestrateur, sait créer un DAG avec catchup=False	Confond Airflow avec un framework de traitement, met de la logique métier dans le DAG
Confirmé	TaskFlow API, XCom, trigger rules, branchement, connexions	Utilise TaskFlow, sait quand ne pas utiliser XCom, explique les trigger rules	Ne connaît pas TaskFlow, passe des DataFrames via XCom
Senior	Architecture executor, Secrets Backend, SLAs, dynamic task mapping	Cite CeleryExecutor vs KubernetesExecutor, a configuré un Secrets Backend	Ne sait pas expliquer la différence entre les executors
Lead	Architecture cloud (MWAA/Composer), standards équipe, CI/CD DAGs	A migré vers MWAA ou Composer, a mis en place des tests automatisés sur les DAGs	Ne peut pas expliquer comment déployer Airflow en haute disponibilité

Home›Blog›Advanced Airflow technical interview

Data hiring guide

Advanced Airflow technical interview: DAGs, XCom, TaskFlow, architecture

Airflow is the reference orchestrator in data engineering. But many profiles only know basic DAGs. Here are the advanced concepts we test to validate a production Data Engineer.

Data Builder·June 2025·8 min read·Data Engineer

Contents

DAGs and operators
TaskFlow API
XCom
Control flow and trigger rules
Airflow architecture
Airflow in production
Level grid

Airflow is not a processing framework — it is an orchestrator. Understanding this distinction is the first thing we check in an interview. A Data Engineer who passes datasets via XCom or puts business logic inside their DAGs has not yet reached Senior level.

1DAGs, operators and structure

Key question

What is the difference between an operator, a sensor and a TaskFlow task decorator? In which case do you use each one?

A DAG (Directed Acyclic Graph) is a directed graph with no cycles. It defines the execution order of tasks, their dependencies, their scheduling — but not their business content. The DAG is the orchestrator, not the processor.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    dag_id="mon_pipeline",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@daily",
    catchup=False,
    max_active_runs=1,
) as dag:

    extract = PythonOperator(
        task_id="extraction",
        python_callable=extraire_donnees,
    )

    transform = PythonOperator(
        task_id="transformation",
        python_callable=transformer_donnees,
    )

    extract >> transform  # dependency: transform waits for extract

Operators — predefined tasks: PythonOperator, BashOperator, SparkSubmitOperator, etc.
Sensors — wait for an external event (S3FileSensor, HttpSensor, ExternalTaskSensor)
TaskFlow (@task) — decorated Python functions, automatic value passing via XCom
catchup=False — do not re-execute past runs when the DAG starts
max_active_runs — limits concurrent executions of the same DAG

Best practice: 1 task = 1 responsibility. A DAG that performs extraction + transformation + loading in a single Python task is an anti-pattern — it breaks observability and granular retry.

2TaskFlow API: Airflow's modern DAX

Key question

What are the advantages of the TaskFlow API over classic PythonOperators? What are its limitations?

Introduced in Airflow 2.0, the TaskFlow API radically simplifies writing Python DAGs by removing explicit XCom boilerplate.

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule_interval="@daily", start_date=datetime(2025, 1, 1), catchup=False)
def mon_pipeline_taskflow():

    @task
    def extraire() -> dict:
        return {"lignes": 1500, "source": "postgres"}

    @task
    def transformer(data: dict) -> dict:
        return {"lignes_traitees": data["lignes"], "source": data["source"]}

    @task
    def charger(data: dict):
        print(f"Chargement de {data['lignes_traitees']} lignes depuis {data['source']}")

    # Dependencies are inferred automatically
    charger(transformer(extraire()))

mon_pipeline_taskflow()

Advantages — more readable code, implicit XCom, automatically inferred dependencies, less boilerplate
Limitations — only works with Python functions, not suited for specialized operators (Spark, BigQuery, etc.)
Task Groups — visually group tasks in the Airflow UI without affecting execution
Dynamic Task Mapping — dynamically generate the number of tasks based on data

3XCom: communication between tasks

Key question

Is Airflow a data processing framework? Why should you not pass datasets via XCom?

XCom (cross-communication) allows tasks to exchange information. It is a lightweight mechanism, not a data transfer system.

# Explicit push in a classic PythonOperator
def ma_tache(**context):
    context['task_instance'].xcom_push(key='nb_lignes', value=1500)

# Explicit pull
def tache_suivante(**context):
    nb = context['task_instance'].xcom_pull(task_ids='ma_tache', key='nb_lignes')
    print(f"Reçu : {nb} lignes")

# With TaskFlow: implicit, via return
@task
def ma_tache() -> int:
    return 1500  # automatically pushed to XCom

Size limit — XCom is stored in the Airflow metadata database (~1GB depending on the backend)
Correct usage — pass metadata (number of rows, S3 path, status), not datasets
Incorrect usage — passing DataFrames or entire files via XCom
Alternative — write data to S3/GCS and pass the path via XCom

Fundamental reminder: Airflow is an orchestrator, not a processing framework. Processing happens in Spark, dbt, or cloud services. Airflow triggers and monitors — it does not process.

4Control flow and trigger rules

Key question

What is a trigger rule and in which case do you use none_failed_min_one_success?

By default, Airflow only executes a task if all its upstream tasks have succeeded. Trigger rules allow you to modify this behavior.

Trigger rule	Execution condition	Use case
`all_success`	All upstream tasks succeeded (default)	Standard pipeline
`all_failed`	All upstream tasks failed	Global failure notification
`all_done`	All upstream tasks are done (success or failure)	Systematic cleanup
`one_success`	At least one upstream task succeeded	Processing as soon as one source is available
`none_failed`	No upstream task failed (success or skipped)	Final task after conditional branching
`none_failed_min_one_success`	No failure + at least one success	Aggregation after branching with partial success

Branching — BranchPythonOperator to execute a conditional branch, other branches are skipped
TriggerDagRunOperator — trigger another DAG from a DAG
ShortCircuitOperator — short-circuit all downstream tasks if a condition is false

5Airflow architecture

Key question

What are the components of a production Airflow architecture? What is the difference between LocalExecutor and CeleryExecutor?

A production Airflow installation has several distinct components. Knowing the architecture is essential for deploying and debugging.

Scheduler — scans DAGs, determines which tasks to launch, submits to the executor
Executor — how tasks are executed: LocalExecutor (same process), CeleryExecutor (distributed workers), KubernetesExecutor (one pod per task)
Webserver — Airflow UI, inspection and manual triggering of DAGs
Metadata database — stores the state of DAGs, tasks, XCom, connections (PostgreSQL recommended in production)
Workers — processes that execute tasks (Celery/Kubernetes only)

In production: LocalExecutor is suitable for up to ~50 DAGs. Beyond that, use CeleryExecutor or KubernetesExecutor. MWAA (AWS Managed Airflow) and Cloud Composer (GCP) manage the infrastructure automatically.

6Airflow in production: connections, SLAs, monitoring

Key question

How do you manage credentials in Airflow? Why should you not hard-code them in DAGs?

Connections — store credentials in the Airflow metadata database, accessible by conn_id in operators
Secrets Backend — integration with AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault for sensitive credentials
Variables — non-sensitive configuration parameters, modifiable without redeploying the DAG
SLAs — define a maximum execution time, trigger an alert if exceeded
Callbacks — on_failure_callback, on_success_callback, on_retry_callback for Slack/PagerDuty notifications
Pools — limit the concurrency of certain tasks (e.g.: avoid saturating a database)

7Level grid

Level	Expected proficiency	GO signal	NO-GO
Junior	Basic DAGs, PythonOperator, scheduling, simple dependencies	Understands that Airflow is an orchestrator, knows how to create a DAG with catchup=False	Confuses Airflow with a processing framework, puts business logic in the DAG
Mid-level	TaskFlow API, XCom, trigger rules, branching, connections	Uses TaskFlow, knows when not to use XCom, ex

Test technique Airflow avancé : DAGs, XCom, TaskFlow, architecture

1DAGs, opérateurs et structure

2TaskFlow API : le DAX moderne d'Airflow

3XCom : communication entre tâches

4Control flow et trigger rules

5Architecture Airflow

6Airflow en production : connexions, SLAs, monitoring

7Grille par niveau

Advanced Airflow technical interview: DAGs, XCom, TaskFlow, architecture

1DAGs, operators and structure

2TaskFlow API: Airflow's modern DAX

3XCom: communication between tasks

4Control flow and trigger rules

5Airflow architecture

6Airflow in production: connections, SLAs, monitoring

7Level grid

Vous recrutez un Data Engineer Airflow ?