Au-dela des modeles de base, un Analytics Engineer Senior maitrise les macros, les packages communautaires, les snapshots pour le SCD2 et la documentation automatique.
dbt est devenu le standard de la transformation analytique. Mais connaitre SELECT ne suffit pas pour un poste Senior — on evalue la maitrise des macros, des packages et du cycle de vie complet des donnees.
1Macros Jinja avancees
Question discriminante
Quelle est la difference entre une macro et un modele ? Quand ecrivez-vous une macro plutot qu un modele ?
-- macros/generate_date_spine.sql
{% macro generate_date_spine(start, end) %}
{{ dbt_utils.date_spine(
datepart='day',
start_date="cast('" ~ start ~ "' as date)",
end_date="cast('" ~ end ~ "' as date)"
) }}
{% endmacro %}
-- macros/get_column_values.sql
{% macro get_top_n(column, table, n=10) %}
{% set query %}
SELECT DISTINCT {{ column }}
FROM {{ ref(table) }}
ORDER BY 1
LIMIT {{ n }}
{% endset %}
{% set results = run_query(query) %}
{{ return(results.columns[0].values()) }}
{% endmacro %}
- Macro — fonction Jinja reutilisable qui genere du SQL. Pas de donnees persistees. Idéal pour la logique repetitive
- Modele — table ou vue persistee dans le data warehouse. Contient de la logique metier
- run_query() — executer une requete SQL dans une macro pour generer du SQL dynamiquement
2Packages dbt-utils et dbt-expectations
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: ['>=1.0.0']
- package: calogica/dbt_expectations
version: ['>=0.10.0']
# Utilisation dans les tests YAML
models:
- name: fct_orders
columns:
- name: amount
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000
- dbt_expectations.expect_column_proportion_of_unique_values_to_be_between:
min_value: 0.95
- dbt_utils — fonctions utilitaires : surrogate_key(), date_spine(), pivot(), star(), get_column_values()
- dbt_expectations — tests inspires de Great Expectations. Comparaisons de distributions, valeurs entre bornes, proportions
- dbt_audit_helper — comparer deux versions d un modele pour valider une refactorisation
3Snapshots : implementer le SCD2
Question discriminante
Qu est-ce qu un snapshot dbt ? Comment implementez-vous un SCD de type 2 avec dbt ?
-- snapshots/snap_customers.sql
{% snapshot snap_customers %}
{{ config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True
) }}
SELECT * FROM {{ source('crm', 'customers') }}
{% endsnapshot %}
-- Colonnes ajoutees automatiquement par dbt :
-- dbt_scd_id : identifiant unique de la version
-- dbt_updated_at : date de la mise a jour
-- dbt_valid_from : date de debut de validite
-- dbt_valid_to : date de fin (NULL = version courante)
- SCD2 — conserver l historique des changements. Chaque modification cree une nouvelle ligne avec des dates de validite
- strategy=timestamp — detecte les changements via une colonne updated_at
- strategy=check — compare les valeurs d une liste de colonnes pour detecter les changements
4Analyses, seeds et exposures
- Analyses — fichiers SQL pour l exploration ad-hoc, non materialises en table. Ideals pour les queries d investigation
- Seeds — petits fichiers CSV charges dans le warehouse via dbt seed. Utiles pour les tables de reference (pays, categories, mappings)
- Exposures — documenter les consumers des modeles dbt (dashboards Tableau, Power BI, apps). Permet la tracabilite dans dbt docs
5Hooks et operations
# dbt_project.yml
models:
mon_projet:
marts:
+post-hook:
- 'GRANT SELECT ON {{ this }} TO ROLE analyst_role'
- '{{ log("Model " ~ this ~ " built", info=True) }}'
# on-run-start et on-run-end
on-run-start:
- 'CREATE SCHEMA IF NOT EXISTS {{ target.schema }}'
on-run-end:
- '{{ elementary.on_run_end() }}' # package de monitoring
6Documentation et lineage automatique
- dbt docs generate — genere une documentation navigable avec le lineage complet de toutes les sources jusqu aux modeles finals
- Descriptions YAML — documenter chaque modele et chaque colonne dans les fichiers schema.yml
- dbt docs serve — interface web locale pour naviguer dans la documentation et le graphe de dependances
- Column-level lineage — avec dbt 1.6+, le lineage est trace au niveau des colonnes (quelle colonne source alimente quelle colonne cible)
# Techniques avancées dbt
# 1. Custom generic test (réutilisable)
# tests/generic/test_not_negative.sql
{% test not_negative(model, column_name) %}
SELECT {{ column_name }} FROM {{ model }}
WHERE {{ column_name }} < 0
{% endtest %}
# 2. Hooks pour des opérations custom
# dbt_project.yml
models:
mon_projet:
marts:
+post-hook: "GRANT SELECT ON {{ this }} TO ROLE analyst_role"
+pre-hook: "DROP TABLE IF EXISTS {{ this }}_old"
# 3. Selector yaml pour orchestration fine
# selectors.yml
selectors:
- name: daily_pipeline
definition:
union:
- tag: daily
- method: path
value: models/marts/
- name: modified_and_downstream
definition:
method: state
value: modified+
# 4. dbt Mesh : cross-project references
# Référencer un modèle d'un autre projet dbt
SELECT * FROM {{ ref('project_a', 'dim_customers') }}
# 5. Environment-based configs
{% if target.name == 'prod' %}
{{ config(materialized='table', full_refresh=false) }}
{% else %}
{{ config(materialized='view') }}
{% endif %}
- dbt Mesh — partager des modèles dbt entre projets via des cross-project refs. Idéal pour les grandes organisations avec plusieurs équipes dbt indépendantes
- Exposure tracking — documenter quels dashboards consomment quels modèles dbt. Permet l'impact analysis : "si je change ce modèle, quels rapports sont affectés ?"
- Artifacts dbt — manifest.json, catalog.json, run_results.json générés après chaque run. Exploitables pour du lineage (DataHub), des métriques CI/CD, la documentation
- Slim CI —
dbt build --select state:modified+ ne teste que les modèles modifiés et leurs descendants. Réduit de 80% le temps de CI sur les PRs
- Performance —
--threads 8 parallélise les modèles indépendants. Sur Snowflake/BigQuery, 8-16 threads est le sweet spot
- dbt Mesh - partager des modeles dbt entre projets via des cross-project refs. Ideal pour les grandes organisations avec plusieurs equipes dbt independantes
- Exposure tracking - documenter quels dashboards consomment quels modeles dbt. Impact analysis : si je change ce modele, quels rapports sont affectes ?
- Slim CI - dbt build --select state:modified+ ne teste que les modeles modifies et leurs descendants. Reduit de 80% le temps de CI sur les PRs
- Performance - --threads 8 parallelise les modeles independants. Sur Snowflake/BigQuery, 8-16 threads est le sweet spot
- Artifacts dbt - manifest.json, catalog.json, run_results.json generes apres chaque run. Exploitables pour du lineage (DataHub), metriques CI/CD, documentation
7Grille par niveau
| Niveau | Maitrise | Signal GO | NO-GO |
|---|
| Junior | Modeles SQL, tests generiques, ref() et source() | Structure un projet dbt, utilise not_null/unique, sait ce que fait ref() | Ne sait pas ce que fait ref() vs une jointure directe |
| Confirme | Macros, dbt-utils, snapshots, CI/CD slim | A ecrit une macro avec run_query, utilise dbt_utils.surrogate_key, a configure un snapshot | N a jamais utilise dbt_utils, ne sait pas ce qu est un snapshot |
| Senior | Packages custom, hooks, exposures, column-level lineage, performance | A cree un package dbt interne, configure les hooks GRANT, documenté les exposures | Ne sait pas ce que sont les exposures, n a jamais optimise les performances dbt |
dbt has become the standard for analytical transformation. But knowing SELECT is not enough for a Senior position — we evaluate mastery of macros, packages and the complete data lifecycle.
1Advanced Jinja Macros
Discriminating question
What is the difference between a macro and a model? When do you write a macro rather than a model?
-- macros/generate_date_spine.sql
{% macro generate_date_spine(start, end) %}
{{ dbt_utils.date_spine(
datepart='day',
start_date="cast('" ~ start ~ "' as date)",
end_date="cast('" ~ end ~ "' as date)"
) }}
{% endmacro %}
-- macros/get_column_values.sql
{% macro get_top_n(column, table, n=10) %}
{% set query %}
SELECT DISTINCT {{ column }}
FROM {{ ref(table) }}
ORDER BY 1
LIMIT {{ n }}
{% endset %}
{% set results = run_query(query) %}
{{ return(results.columns[0].values()) }}
{% endmacro %}
- Macro — reusable Jinja function that generates SQL. No persisted data. Ideal for repetitive logic
- Model — table or view persisted in the data warehouse. Contains business logic
- run_query() — execute a SQL query inside a macro to dynamically generate SQL
2dbt-utils and dbt-expectations Packages
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: ['>=1.0.0']
- package: calogica/dbt_expectations
version: ['>=0.10.0']
# Usage in YAML tests
models:
- name: fct_orders
columns:
- name: amount
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000
- dbt_expectations.expect_column_proportion_of_unique_values_to_be_between:
min_value: 0.95
- dbt_utils — utility functions: surrogate_key(), date_spine(), pivot(), star(), get_column_values()
- dbt_expectations — tests inspired by Great Expectations. Distribution comparisons, values within bounds, proportions
- dbt_audit_helper — compare two versions of a model to validate a refactoring
3Snapshots: implementing SCD2
Discriminating question
What is a dbt snapshot? How do you implement a Type 2 SCD with dbt?
-- snapshots/snap_customers.sql
{% snapshot snap_customers %}
{{ config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True
) }}
SELECT * FROM {{ source('crm', 'customers') }}
{% endsnapshot %}
-- Columns automatically added by dbt:
-- dbt_scd_id : unique version identifier
-- dbt_updated_at : update date
-- dbt_valid_from : validity start date
-- dbt_valid_to : end date (NULL = current version)
- SCD2 — preserve the history of changes. Each modification creates a new row with validity dates
- strategy=timestamp — detects changes via an updated_at column
- strategy=check — compares values from a list of columns to detect changes
4Analyses, seeds and exposures
- Analyses — SQL files for ad-hoc exploration, not materialized as tables. Ideal for investigation queries
- Seeds — small CSV files loaded into the warehouse via dbt seed. Useful for reference tables (countries, categories, mappings)
- Exposures — document the consumers of dbt models (Tableau dashboards, Power BI, apps). Enables traceability in dbt docs
5Hooks and operations
# dbt_project.yml
models:
my_project:
marts:
+post-hook:
- 'GRANT SELECT ON {{ this }} TO ROLE analyst_role'
- '{{ log("Model " ~ this ~ " built", info=True) }}'
# on-run-start and on-run-end
on-run-start:
- 'CREATE SCHEMA IF NOT EXISTS {{ target.schema }}'
on-run-end:
- '{{ elementary.on_run_end() }}' # monitoring package
6Documentation and automatic lineage
- dbt docs generate — generates navigable documentation with the complete lineage from all sources to final models
- YAML descriptions — document each model and each column in schema.yml files
- dbt docs serve — local web interface to browse the documentation and the dependency graph
- Column-level lineage — with dbt 1.6+, lineage is tracked at the column level (which source column feeds which target column)
# Advanced dbt techniques
# 1. Custom generic test (reusable)
# tests/generic/test_not_negative.sql
{% test not_negative(model, column_name) %}
SELECT {{ column_name }} FROM {{ model }}
WHERE {{ column_name }} < 0
{% endtest %}
# 2. Hooks for custom operations
# dbt_project.yml
models:
my_project:
marts:
+post-hook: "GRANT SELECT ON {{ this }} TO ROLE analyst_role"
+pre-hook: "DROP TABLE IF EXISTS {{ this }}_old"
# 3. Selector yaml for fine-grained orchestration
# selectors.yml
selectors:
- name: daily_pipeline
definition:
union:
- tag: daily
- method: path
value: models/marts/
- name: modified_and_downstream
definition:
method: state
value: modified+
# 4. dbt Mesh: cross-project references
# Reference a model from another dbt project
SELECT * FROM {{ ref('project_a', 'dim_customers') }}
# 5. Environment-based configs
{% if target.name == 'prod' %}
{{ config(materialized='table', full_refresh=false) }}
{% else %}
{{ config(materialized='view') }}
{% endif %}
- dbt Mesh — share dbt models between projects via cross-project refs. Ideal for large organizations with multiple independent dbt teams
- Exposure tracking — document which dashboards consume which dbt models. Enables impact analysis: "if I change this model, which reports are affected?"
- dbt Artifacts — manifest.json, catalog.json, run_results.json generated after each run. Usable for lineage (DataHub), CI/CD metrics, documentation
- Slim CI —
dbt build --select state:modified+ only tests modified models and their descendants. Reduces CI time on PRs by 80%
- Performance —
--threads 8 parallelizes independent models. On Snowflake/BigQuery, 8-16 threads is the sweet spot
- dbt Mesh - share dbt models between projects via cross-project refs. Ideal for large organizations with multiple independent dbt teams
- Exposure tracking - document which dashboards consume which dbt models. Impact analysis: if I change this model, which reports are affected?
- Slim CI - dbt build --select state:modified+ only tests modified models and their descendants. Reduces CI time on PRs by 80%
- Performance - --threads 8 parallelizes independent models. On Snowflake/BigQuery, 8-16 threads is the sweet spot
- dbt Artifacts - manifest.json, catalog.json, run_results.json generated after each run. Usable for lineage (DataHub), CI/CD metrics, documentation
7Level grid
| Level | Mastery | GO signal | NO-GO |
|---|
| Junior | SQL models, generic tests, ref() and source() | Structures a dbt project, uses not_null/unique, knows what ref() does | Does not know what ref() does vs a direct join |
| Mid-level | Macros, dbt-utils, snapshots, slim CI/CD | Has written a macro with run_query, uses dbt_utils.surrogate_key, has configured a snapshot | Has never used dbt_utils, does not know what a snapshot is |
| Senior | Custom packages, hooks, exposures, column-level lineage, performance | Has created an internal dbt package, configured GRANT hooks, documented exposures | Does not know what exposures are, has never optimized dbt performance |