Docker est le pre-requis pour tout Data Engineer moderne. En entretien, on va au-dela du simple docker run — on evalue la capacite a architecturer des stacks locales et des images optimisees.
1Dockerfile optimise pour un projet data
Question discriminante
Quelles sont les bonnes pratiques pour un Dockerfile Python data ?
# Multi-stage build : image finale legere
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
# --no-cache-dir : reduit la taille de l image
RUN pip install --no-cache-dir --user -r requirements.txt
# Image finale : ne pas copier pip, seulement les packages installes
FROM python:3.11-slim
WORKDIR /app
# Copier seulement les packages installes
COPY --from=builder /root/.local /root/.local
COPY . .
# Pas de root en production
RUN useradd -m appuser
USER appuser
# Variables d environnement
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
CMD ['python', 'main.py']
- slim — image de base minimale. Eviter :latest, toujours specifier la version
- Layer caching — copier requirements.txt AVANT le code source pour maximiser le cache
- Non-root user — bonne pratique de securite
- PYTHONUNBUFFERED=1 — logs visibles en temps reel
2Docker Compose : stack data locale complete
Question discriminante
Comment monteriez-vous une stack data locale avec Airflow, PostgreSQL et dbt ?
version: '3.8'
services:
postgres:
image: postgres:15
environment:
POSTGRES_DB: datadb
POSTGRES_USER: data
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- '5432:5432'
healthcheck:
test: ['CMD', 'pg_isready', '-U', 'data']
interval: 10s
airflow:
image: apache/airflow:2.9.0
depends_on:
postgres:
condition: service_healthy
environment:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://data:${POSTGRES_PASSWORD}@postgres/airflow
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
ports:
- '8080:8080'
dbt:
build: ./dbt
depends_on: [postgres]
volumes:
- ./dbt:/dbt
command: ['dbt', 'run']
volumes:
postgres_data:
3Networking : comment les services communiquent
Question discriminante
Comment un service Airflow appelle-t-il un service PostgreSQL dans Docker Compose ?
- DNS interne — dans Docker Compose, chaque service est accessible via son nom de service. Ex : postgres:5432 depuis airflow
- Ports exposes — ports: expose sur la machine hote. Pas necessaire pour la communication inter-services
- healthcheck + depends_on — attendre que le service soit vraiment pret avant de demarrer les dependances
- Networks — isoler des groupes de services. Par defaut, tous les services d un Compose sont sur le meme reseau
4Multi-stage builds : optimiser la taille
Question discriminante
Pourquoi utilise-t-on les multi-stage builds ? Quelle reduction de taille obtient-on typiquement ?
- Probleme — les outils de build (gcc, pip, headers) ne sont pas necessaires en production mais augmentent la taille de l image
- Solution — builder dans une image complete, copier seulement les artefacts finaux dans une image slim
- Gains typiques — image Python avec numpy/pandas : 800MB → 200MB avec multi-stage
- Layer caching — les layers sont mis en cache. Copier requirements.txt avant le code pour que le rebuild soit rapide
5Registry et CI/CD des images
Question discriminante
Comment organisez-vous la gestion des images Docker dans un projet data d equipe ?
- Registry — Google Artifact Registry, AWS ECR, Docker Hub (public). Jamais pousser directement en latest sans tag
- Tagging — tag avec le SHA du commit (image:sha-abc123) pour la tracabilite
- Build en CI — GitHub Actions ou Cloud Build construit et pousse l image automatiquement
- .dockerignore — exclure .git, __pycache__, .env, les fichiers de donnees. Reduit le contexte de build
6De Docker Compose vers Kubernetes
- docker-compose.yml → Kubernetes manifests — un service Compose = un Deployment + Service K8s
- kompose convert — outil officiel pour convertir un docker-compose.yml en manifests K8s
- Docker Compose pour le dev local — K8s pour la production. Ne pas essayer de faire du K8s en local pour le dev quotidien
- Tilt / Skaffold — workflows de developpement local qui synchronisent le code avec un cluster K8s
# docker-compose.yml stack data locale complète
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_DB: databuilder
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on: [zookeeper]
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
airflow:
image: apache/airflow:2.8.0
depends_on:
postgres: {condition: service_healthy}
environment:
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://postgres:${POSTGRES_PASSWORD}@postgres/airflow
AIRFLOW__CORE__EXECUTOR: LocalExecutor
volumes:
- ./dags:/opt/airflow/dags
command: ["airflow", "standalone"]
dbt:
build: ./dbt
volumes:
- ./dbt:/dbt
depends_on: [postgres]
command: ["dbt", "run", "--profiles-dir", "/dbt"]
volumes:
postgres_data:
- Healthchecks — toujours définir des healthchecks sur les services stateful (Postgres, Kafka, Redis). Les
depends_on sans condition vérifient seulement que le container démarre, pas qu'il est prêt
- Volumes nommés vs bind mounts — volumes nommés pour les données persistantes (Postgres data). Bind mounts pour le code source (./dags:/opt/airflow/dags) afin de développer sans rebuild
- .env files — stocker les secrets dans .env (jamais dans docker-compose.yml). Ajouter .env au .gitignore. Fournir un .env.example avec des valeurs factices
- Docker Compose profiles —
--profile monitoring pour démarrer Prometheus + Grafana uniquement en dev. Évite de surcharger l'environnement de base
- Multi-stage builds — pour les images Python avec dependencies lourdes (PyTorch, Spark) : build stage installe les dépendances, runtime stage copie uniquement le nécessaire. Réduit la taille de l'image
- Healthchecks obligatoires - definir des healthchecks sur les services stateful (Postgres, Kafka, Redis). Les depends_on sans condition verifient seulement que le container demarre
- Volumes nommes vs bind mounts - volumes nommes pour les donnees persistantes (Postgres data). Bind mounts pour le code source (./dags:/opt/airflow/dags)
- .env files - stocker les secrets dans .env (jamais dans docker-compose.yml). Ajouter .env au .gitignore. Fournir un .env.example avec des valeurs factices
- Docker Compose profiles - --profile monitoring pour demarrer Prometheus + Grafana uniquement en dev. Evite de surcharger l environnement de base
- Multi-stage builds - build stage installe les dependances, runtime stage copie uniquement le necessaire. Reduit la taille de l image de 3GB a 500MB
7Grille par niveau
| Niveau | Maitrise | Signal GO | NO-GO |
|---|
| Junior | Dockerfile basique, docker run, docker-compose up | Sait ecrire un Dockerfile Python, lance une stack avec Compose | Ne sait pas ce qu est un volume |
| Confirme | Multi-stage, healthchecks, networking Compose, .dockerignore | Utilise multi-stage, configure les healthchecks, sait pourquoi postgres:5432 marche entre services | Utilise :latest partout, ne connait pas les multi-stage builds |
| Senior | Registry CI/CD, securite (non-root, secrets), migration K8s | A configure le push automatique en CI, utilise des users non-root, connait kompose | Pousse des images avec des credentials en dur dans le Dockerfile |
1Optimized Dockerfile for a data project
Discriminating question
What are the best practices for a Python data Dockerfile?
# Multi-stage build : lightweight final image
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
# --no-cache-dir : reduces image size
RUN pip install --no-cache-dir --user -r requirements.txt
# Final image : do not copy pip, only installed packages
FROM python:3.11-slim
WORKDIR /app
# Copy only installed packages
COPY --from=builder /root/.local /root/.local
COPY . .
# No root in production
RUN useradd -m appuser
USER appuser
# Environment variables
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
CMD ['python', 'main.py']
- slim — minimal base image. Avoid :latest, always specify the version
- Layer caching — copy requirements.txt BEFORE the source code to maximize cache
- Non-root user — security best practice
- PYTHONUNBUFFERED=1 — logs visible in real time
2Docker Compose: complete local data stack
Discriminating question
How would you set up a local data stack with Airflow, PostgreSQL and dbt?
version: '3.8'
services:
postgres:
image: postgres:15
environment:
POSTGRES_DB: datadb
POSTGRES_USER: data
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- '5432:5432'
healthcheck:
test: ['CMD', 'pg_isready', '-U', 'data']
interval: 10s
airflow:
image: apache/airflow:2.9.0
depends_on:
postgres:
condition: service_healthy
environment:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://data:${POSTGRES_PASSWORD}@postgres/airflow
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
ports:
- '8080:8080'
dbt:
build: ./dbt
depends_on: [postgres]
volumes:
- ./dbt:/dbt
command: ['dbt', 'run']
volumes:
postgres_data:
3Networking: how services communicate
Discriminating question
How does an Airflow service call a PostgreSQL service in Docker Compose?
- Internal DNS — in Docker Compose, each service is accessible via its service name. Ex: postgres:5432 from airflow
- Exposed ports — ports: exposes on the host machine. Not necessary for inter-service communication
- healthcheck + depends_on — wait for the service to be truly ready before starting dependencies
- Networks — isolate groups of services. By default, all services in a Compose are on the same network
4Multi-stage builds: optimizing size
Discriminating question
Why do we use multi-stage builds? What size reduction is typically achieved?
- Problem — build tools (gcc, pip, headers) are not needed in production but increase image size
- Solution — build in a full image, copy only the final artifacts into a slim image
- Typical gains — Python image with numpy/pandas: 800MB → 200MB with multi-stage
- Layer caching — layers are cached. Copy requirements.txt before the code so rebuilds are fast
5Registry and image CI/CD
Discriminating question
How do you manage Docker images in a team data project?
- Registry — Google Artifact Registry, AWS ECR, Docker Hub (public). Never push directly to latest without a tag
- Tagging — tag with the commit SHA (image:sha-abc123) for traceability
- Build in CI — GitHub Actions or Cloud Build builds and pushes the image automatically
- .dockerignore — exclude .git, __pycache__, .env, data files. Reduces the build context
6From Docker Compose to Kubernetes
- docker-compose.yml → Kubernetes manifests — one Compose service = one Deployment + K8s Service
- kompose convert — official tool to convert a docker-compose.yml into K8s manifests
- Docker Compose for local dev — K8s for production. Do not try to use K8s locally for day-to-day development
- Tilt / Skaffold — local development workflows that sync code with a K8s cluster
# docker-compose.yml complete local data stack
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_DB: databuilder
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on: [zookeeper]
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
airflow:
image: apache/airflow:2.8.0
depends_on:
postgres: {condition: service_healthy}
environment:
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://postgres:${POSTGRES_PASSWORD}@postgres/airflow
AIRFLOW__CORE__EXECUTOR: LocalExecutor
volumes:
- ./dags:/opt/airflow/dags
command: ["airflow", "standalone"]
dbt:
build: ./dbt
volumes:
- ./dbt:/dbt
depends_on: [postgres]
command: ["dbt", "run", "--profiles-dir", "/dbt"]
volumes:
postgres_data:
- Healthchecks — always define healthchecks on stateful services (Postgres, Kafka, Redis).
depends_on without a condition only checks that the container starts, not that it is ready
- Named volumes vs bind mounts — named volumes for persistent data (Postgres data). Bind mounts for source code (./dags:/opt/airflow/dags) to develop without rebuilding
- .env files — store secrets in .env (never in docker-compose.yml). Add .env to .gitignore. Provide a .env.example with dummy values
- Docker Compose profiles —
--profile monitoring to start Prometheus + Grafana only in dev. Avoids overloading the base environment
- Multi-stage builds — for Python images with heavy dependencies (PyTorch, Spark): build stage installs dependencies, runtime stage copies only what is needed. Reduces image size
- Mandatory healthchecks - define healthchecks on stateful services (Postgres, Kafka, Redis). depends_on without a condition only checks that the container starts
- Named volumes vs bind mounts - named volumes for persistent data (Postgres data). Bind mounts for source code (./dags:/opt/airflow/dags)
- .env files - store secrets in .env (never in docker-compose.yml). Add .env to .gitignore. Provide a .env.example with dummy values
- Docker Compose profiles - --profile monitoring to start Prometheus + Grafana only in dev. Avoids overloading the base environment
- Multi-stage builds - build stage installs dependencies, runtime stage copies only what is needed. Reduces image size from 3GB to 500MB
7Level grid
| Level | Mastery | GO signal | NO-GO |
|---|
| Junior | Basic Dockerfile, docker run, docker-compose up | Can write a Python Dockerfile, launches a stack with Compose | Does not know what a volume is |
| Confirmed | Multi-stage, healthchecks, Compose networking, .dockerignore | Uses multi-stage, configures healthchecks, knows why postgres:5432 works between services | Uses :latest everywhere, does not know multi-stage builds |
| Senior | Registry CI/CD, security (non-root, secrets), K8s migration | Has configured automatic push in CI, uses non-root users, knows kompose | Pushes images with hardcoded credentials in the Dockerfile |