Accueil›Blog›Test technique data lineage : OpenLineage, DataHub et traçabilité

Guide recrutement data

Test technique data lineage : OpenLineage, DataHub et traçabilité

Le data lineage permet de savoir d où vient une donnée et où elle va. En 2025, c est un sujet Senior qui différencie les profils avec une vraie culture de gouvernance.

Data Builder·Juin 2025·6 min de lecture·Analytics Engineer · Data Engineer

Sommaire

Pourquoi le data lineage
OpenLineage standard
DataHub et lineage
Lineage dbt
Column-level lineage
Impact analysis
Grille

1Pourquoi le data lineage est critique

Question discriminante

Dans quels cas concrets le data lineage vous a-t-il été utile ?

Impact analysis — 'si je renomme cette colonne dans stg_orders, quels dashboards cassent ?' Sans lineage, impossible de répondre
Root cause analysis — 'le KPI revenu a baissé de 20% hier, quelle transformation est responsable ?' Le lineage permet de remonter à la source
Conformité RGPD — 'où sont utilisées les données personnelles de cette table source ?' Le lineage répond en secondes
Onboarding — comprendre comment les données circulent dans l organisation sans interroger des dizaines de personnes

2OpenLineage : le standard ouvert

Question discriminante

Qu est-ce qu OpenLineage ? Pourquoi est-ce un standard important ?

# Exemple d événement OpenLineage émis par Airflow
{
  'eventType': 'COMPLETE',
  'eventTime': '2025-01-15T08:30:00Z',
  'job': {
    'namespace': 'airflow',
    'name': 'pipeline_ventes.transform_orders'
  },
  'inputs': [{
    'namespace': 'snowflake',
    'name': 'RAW_DB.FIVETRAN.ORDERS',
    'facets': {
      'schema': {
        'fields': [{'name': 'order_id', 'type': 'VARCHAR'}]
      }
    }
  }],
  'outputs': [{
    'namespace': 'snowflake',
    'name': 'PROD_DB.STAGING.STG_ORDERS'
  }]
}

OpenLineage — standard ouvert (Linux Foundation) pour émettre des événements de lineage depuis n importe quel outil
Intégrations natives — Airflow, Spark, dbt, Flink, Great Expectations émettent des événements OpenLineage
Backends — Marquez (référence open source), DataHub, Atlan peuvent recevoir ces événements

3DataHub : catalogue et lineage centralisé

Question discriminante

Comment DataHub agrège-t-il le lineage de plusieurs outils différents ?

DataHub — plateforme de gouvernance open source (LinkedIn) qui centralise les métadonnées et le lineage
Sources d ingestion — Snowflake, BigQuery, dbt, Airflow, Looker, Power BI, Kafka : chaque source a un connecteur DataHub
UI de lineage — graphe interactif qui montre les dépendances de la source jusqu aux dashboards
Déploiement — Docker Compose pour le dev local, Kubernetes pour la production

# Lancer DataHub localement
pip install acryl-datahub
datahub docker quickstart

# Ingestor dbt
datahub ingest -c dbt_ingest.yaml
# dbt_ingest.yaml
source:
  type: dbt
  config:
    manifest_path: ./target/manifest.json
    catalog_path: ./target/catalog.json
    sources_path: ./target/sources.json

4Lineage dbt : le point de départ

Question discriminante

Comment dbt génère-t-il le lineage automatiquement ?

ref() et source() — chaque appel à ref() et source() dans un modèle dbt crée une arête dans le graphe de lineage
dbt DAG — dbt construit automatiquement le graphe de dépendances à partir des refs. Visualisable dans dbt docs
dbt docs generate — génère un site de documentation avec le graphe de lineage navigable
Exposure — documenter les consommateurs finaux (dashboards, APIs) comme arêtes sortantes du lineage

5Column-level lineage : la granularité maximale

Question discriminante

Qu est-ce que le column-level lineage ? Pourquoi est-il plus puissant que le table-level lineage ?

Table-level lineage — 'la table fct_orders dépend de stg_orders'. Utile mais pas suffisant pour l impact analysis
Column-level lineage — 'la colonne fct_orders.amount vient de stg_orders.unit_price * stg_orders.quantity'. Granularité fine
Cas d usage — identifier précisément quelle colonne source impacte quel KPI, même après 10 transformations
Outils — dbt 1.6+ génère le column-level lineage automatiquement. DataHub et Atlan le visualisent
Limite — les transformations complexes (Python, Spark) sont difficiles à tracer au niveau colonne

6Impact analysis en pratique

Question discriminante

Montrez comment vous utilisez le lineage pour évaluer l impact d un changement.

# dbt : identifier les modèles impactés par un changement
# Si je modifie stg_orders.sql, quels modèles sont affectés ?
dbt ls --select stg_orders+  # stg_orders et tous ses descendants

# Exclure les modèles en dehors du périmètre
dbt build --select stg_orders+  # rebuilder la chaîne complète

# DataHub API : impact analysis programmatique
from datahub.ingestion.api.common import PipelineContext
# Requêter les downstream datasets d une colonne
client.get_downstream_lineage(
    dataset='urn:li:dataset:(urn:li:dataPlatform:snowflake,PROD_DB.STAGING.STG_ORDERS,PROD)'
)

# OpenLineage : émettre du lineage depuis n'importe quel job
from openlineage.client import OpenLineageClient, set_producer
from openlineage.client.run import RunEvent, Run, Job
from openlineage.client.facet import SqlJobFacet, DatasetFacet

client = OpenLineageClient.from_environment()
set_producer("https://github.com/my-org/data-pipeline")

# Émettre un event de début de job
run_event = RunEvent(
    eventType="START",
    eventTime=datetime.now().isoformat(),
    run=Run(runId=str(uuid.uuid4())),
    job=Job(namespace="data-platform", name="transform_orders"),
    inputs=[DatasetFacet(namespace="snowflake", name="raw.orders")],
    outputs=[DatasetFacet(namespace="snowflake", name="analytics.fct_orders")]
)
client.emit(run_event)

# Intégrations natives OpenLineage
# - dbt : dbt-ol package (lineage automatique)
# - Airflow : openlineage-airflow provider
# - Spark : openlineage-spark jar
# - Great Expectations : openlineage-great-expectations

OpenLineage vs DataHub lineage — OpenLineage : standard ouvert pour émettre du lineage depuis n'importe quel outil. DataHub/Marquez/Atlan : backends qui consomment ces events
Marquez — metadata repository open source compatible OpenLineage. Alternative légère à DataHub pour le lineage seul
Lineage colonne par colonne — le lineage table-to-table ne suffit pas. Le lineage colonne permet de tracer l'origine de chaque champ (ex : amount dans fct_orders vient de raw.transactions.price)
Impact analysis — avec le lineage, répondre en 30 secondes à "si je change cette colonne source, quels dashboards seront impactés ?"
Airflow integration — installer apache-airflow-providers-openlineage. Chaque task Airflow émet automatiquement des events de lineage vers Marquez ou DataHub

OpenLineage vs DataHub lineage - OpenLineage : standard ouvert pour emettre du lineage depuis n importe quel outil. DataHub/Marquez/Atlan : backends qui consomment ces events
Marquez - metadata repository open source compatible OpenLineage. Alternative legere a DataHub pour le lineage seul
Lineage colonne par colonne - le lineage table-to-table ne suffit pas. Tracer l origine de chaque champ permet l impact analysis precise
Impact analysis - avec le lineage, repondre en 30 secondes a : si je change cette colonne source, quels dashboards seront impactes ?
Airflow integration - installer apache-airflow-providers-openlineage. Chaque task Airflow emet automatiquement des events de lineage vers Marquez ou DataHub

7Grille par niveau

Niveau	Maitrise	Signal GO	NO-GO
Confirmé	Lineage dbt, dbt docs, comprend le concept	Utilise ref() systématiquement, sait générer dbt docs avec le lineage	Ne sait pas ce qu est le data lineage
Senior	OpenLineage, DataHub, column-level lineage, impact analysis	A déployé DataHub, utilise l impact analysis avant chaque changement critique	N a jamais entendu parler d OpenLineage

1Why data lineage is critical

Discriminating question

In which concrete cases has data lineage been useful to you?

Impact analysis — 'if I rename this column in stg_orders, which dashboards will break?' Without lineage, impossible to answer
Root cause analysis — 'the revenue KPI dropped 20% yesterday, which transformation is responsible?' Lineage allows tracing back to the source
GDPR compliance — 'where is the personal data from this source table being used?' Lineage answers in seconds
Onboarding — understanding how data flows through the organization without having to ask dozens of people

2OpenLineage: the open standard

Discriminating question

What is OpenLineage? Why is it an important standard?

# Example of an OpenLineage event emitted by Airflow
{
  'eventType': 'COMPLETE',
  'eventTime': '2025-01-15T08:30:00Z',
  'job': {
    'namespace': 'airflow',
    'name': 'pipeline_ventes.transform_orders'
  },
  'inputs': [{
    'namespace': 'snowflake',
    'name': 'RAW_DB.FIVETRAN.ORDERS',
    'facets': {
      'schema': {
        'fields': [{'name': 'order_id', 'type': 'VARCHAR'}]
      }
    }
  }],
  'outputs': [{
    'namespace': 'snowflake',
    'name': 'PROD_DB.STAGING.STG_ORDERS'
  }]
}

OpenLineage — open standard (Linux Foundation) for emitting lineage events from any tool
Native integrations — Airflow, Spark, dbt, Flink, Great Expectations natively emit OpenLineage events
Backends — Marquez (open source reference), DataHub, Atlan can receive these events

3DataHub: centralized catalog and lineage

Discriminating question

How does DataHub aggregate lineage from several different tools?

DataHub — open source governance platform (LinkedIn) that centralizes metadata and lineage
Ingestion sources — Snowflake, BigQuery, dbt, Airflow, Looker, Power BI, Kafka: each source has a DataHub connector
Lineage UI — interactive graph showing dependencies from the source all the way to dashboards
Deployment — Docker Compose for local dev, Kubernetes for production

# Launch DataHub locally
pip install acryl-datahub
datahub docker quickstart

# dbt ingestor
datahub ingest -c dbt_ingest.yaml
# dbt_ingest.yaml
source:
  type: dbt
  config:
    manifest_path: ./target/manifest.json
    catalog_path: ./target/catalog.json
    sources_path: ./target/sources.json

4dbt lineage: the starting point

Discriminating question

How does dbt automatically generate lineage?

ref() and source() — every call to ref() and source() in a dbt model creates an edge in the lineage graph
dbt DAG — dbt automatically builds the dependency graph from refs. Viewable in dbt docs
dbt docs generate — generates a documentation site with a navigable lineage graph
Exposure — documenting final consumers (dashboards, APIs) as outgoing edges in the lineage

5Column-level lineage: maximum granularity

Discriminating question

What is column-level lineage? Why is it more powerful than table-level lineage?

Table-level lineage — 'the fct_orders table depends on stg_orders'. Useful but not sufficient for impact analysis
Column-level lineage — 'the fct_orders.amount column comes from stg_orders.unit_price * stg_orders.quantity'. Fine-grained granularity
Use cases — precisely identifying which source column impacts which KPI, even after 10 transformations
Tools — dbt 1.6+ automatically generates column-level lineage. DataHub and Atlan visualize it
Limitation — complex transformations (Python, Spark) are difficult to trace at the column level

6Impact analysis in practice

Discriminating question

Show how you use lineage to assess the impact of a change.

# dbt: identify models impacted by a change
# If I modify stg_orders.sql, which models are affected?
dbt ls --select stg_orders+  # stg_orders and all its descendants

# Exclude models outside the scope
dbt build --select stg_orders+  # rebuild the full chain

# DataHub API: programmatic impact analysis
from datahub.ingestion.api.common import PipelineContext
# Query the downstream datasets of a column
client.get_downstream_lineage(
    dataset='urn:li:dataset:(urn:li:dataPlatform:snowflake,PROD_DB.STAGING.STG_ORDERS,PROD)'
)

# OpenLineage: emit lineage from any job
from openlineage.client import OpenLineageClient, set_producer
from openlineage.client.run import RunEvent, Run, Job
from openlineage.client.facet import SqlJobFacet, DatasetFacet

client = OpenLineageClient.from_environment()
set_producer("https://github.com/my-org/data-pipeline")

# Emit a job start event
run_event = RunEvent(
    eventType="START",
    eventTime=datetime.now().isoformat(),
    run=Run(runId=str(uuid.uuid4())),
    job=Job(namespace="data-platform", name="transform_orders"),
    inputs=[DatasetFacet(namespace="snowflake", name="raw.orders")],
    outputs=[DatasetFacet(namespace="snowflake", name="analytics.fct_orders")]
)
client.emit(run_event)

# Native OpenLineage integrations
# - dbt: dbt-ol package (automatic lineage)
# - Airflow: openlineage-airflow provider
# - Spark: openlineage-spark jar
# - Great Expectations: openlineage-great-expectations

OpenLineage vs DataHub lineage — OpenLineage: open standard for emitting lineage from any tool. DataHub/Marquez/Atlan: backends that consume these events
Marquez — open source metadata repository compatible with OpenLineage. Lightweight alternative to DataHub for lineage only
Column-by-column lineage — table-to-table lineage is not enough. Column lineage allows tracing the origin of each field (e.g.: amount in fct_orders comes from raw.transactions.price)
Impact analysis — with lineage, answer in 30 seconds: 'if I change this source column, which dashboards will be impacted?'
Airflow integration — install apache-airflow-providers-openlineage. Each Airflow task automatically emits lineage events to Marquez or DataHub

OpenLineage vs DataHub lineage - OpenLineage: open standard for emitting lineage from any tool. DataHub/Marquez/Atlan: backends that consume these events
Marquez - open source metadata repository compatible with OpenLineage. Lightweight alternative to DataHub for lineage only
Column-by-column lineage - table-to-table lineage is not enough. Tracing the origin of each field enables precise impact analysis
Impact analysis - with lineage, answer in 30 seconds: if I change this source column, which dashboards will be impacted?
Airflow integration - install apache-airflow-providers-openlineage. Each Airflow task automatically emits lineage events to Marquez or DataHub

7Level grid

Level	Mastery	GO signal	NO-GO
Confirmed	dbt lineage, dbt docs, understands the concept	Uses ref() systematically, knows how to generate dbt docs with lineage	Does not know what data lineage is
Senior	OpenLineage, DataHub, column-level lineage, impact analysis	Has deployed DataHub, uses impact analysis before every critical change	Has never heard of OpenLineage

Vous recrutez un Analytics Engineer ou Data Engineer ?

Premier entretien gratuit. Rapport GO/NO-GO sous 48h.

Tester gratuitement Reserver un appel