Guide recrutement data
Test technique AWS pour la Data : ce qu'on evalue en entretien
AWS est le cloud dominant en data engineering. Entre creer un bucket S3 et concevoir une architecture data lakehouse, l'ecart est considerable.
Data Builder·Juin 2025·8 min de lecture·Data Engineer
AWS est le cloud dominant en data engineering. "Utiliser AWS" peut signifier avoir cree un bucket S3 ou avoir concu une architecture data lakehouse complete.
1S3 et stockage
Question discriminante
Decrivez les principales options de stockage pour un pipeline data sur AWS.
- S3 — stockage objet central de tout data lake AWS
- S3 storage classes : Standard, Infrequent Access, Glacier — optimisation couts
- Partitionnement S3 — organisation des prefixes pour Athena/Glue
- Formats : Parquet, Delta Lake, Iceberg sur S3
2IAM et securite
Question discriminante
Comment gerez-vous les acces aux donnees sensibles dans un pipeline AWS ?
- IAM roles vs users — toujours preferer les roles
- Principe du moindre privilege
- AWS Secrets Manager — rotation automatique des secrets
- KMS — chiffrement des donnees au repos
Signal d'alerte : un profil qui stocke des credentials AWS dans le code ou des fichiers versiones est eliminatoire.
3Services analytics
Question discriminante
Pour un pipeline ELT (ingest -> transform -> expose), quels services AWS choisissez-vous ?
- Glue — ETL serverless + Glue Data Catalog
- Athena — SQL interactif sur S3 (pay-per-query)
- Redshift — data warehouse pour les gros volumes analytiques
- EMR — clusters Spark manages
- Kinesis — ingestion de donnees en streaming
- Step Functions — orchestration de workflows serverless
4Infrastructure data
Question discriminante
Comment deploieriez-vous un pipeline Airflow sur AWS ?
- MWAA — Airflow manage par AWS
- ECS / EKS — containers manages
- Infrastructure as Code : Terraform ou CDK
- VPC et subnets — isolation reseau des composants data
5Couts et optimisation
Question discriminante
Votre facture AWS a double ce mois. Par quoi commencez-vous ?
- AWS Cost Explorer — analyser les couts par service et par tag
- Spot instances — reduire les couts EMR de 70-90%
- S3 Intelligent-Tiering — optimisation automatique des classes
6Grille par niveau
| Niveau | Maitrise attendue | Signal GO | NO-GO |
|---|
| Junior | S3 basique, IAM concepts, Athena, Lambda | Comprend les storage classes, utilise les roles IAM | Stocke des credentials dans le code |
| Confirme | Glue, Redshift, Kinesis, VPC basique | A concu un pipeline ELT sur AWS, utilise Secrets Manager | Ne connait pas la difference EMR vs Glue |
| Senior | Architecture data lakehouse, IaC, MWAA, couts | A deploye une architecture complete avec Terraform | Ne sait pas diagnostiquer une facture anormale |
| Lead | Architecture multi-comptes, gouvernance, Landing Zone | A mis en place une AWS Landing Zone data | Ne peut pas expliquer MWAA vs ECS |
Home›Blog›AWS for Data technical interview
Data hiring guide
AWS for Data technical interview: what we really assess
AWS is the dominant cloud in data engineering. Between creating an S3 bucket and designing a data lakehouse architecture, the gap is considerable.
Data Builder·June 2025·8 min read·Data Engineer
AWS is the dominant cloud in data engineering. "Using AWS" can mean having created an S3 bucket or having designed a complete data lakehouse architecture.
1S3 and storage
Key question
Describe the main storage options for a data pipeline on AWS.
- S3 — central object storage for any AWS data lake
- S3 storage classes: Standard, Infrequent Access, Glacier — cost optimization
- S3 partitioning — prefix organization for Athena/Glue
- Formats: Parquet, Delta Lake, Iceberg on S3
2IAM and security
Key question
How do you manage access to sensitive data in an AWS pipeline?
- IAM roles vs users — always prefer roles
- Principle of least privilege
- AWS Secrets Manager — automatic secret rotation
- KMS — encryption of data at rest
Warning signal: a candidate who stores AWS credentials in code or versioned files is an automatic disqualification.
3Analytics services
Key question
For an ELT pipeline (ingest -> transform -> expose), which AWS services do you choose?
- Glue — serverless ETL + Glue Data Catalog
- Athena — interactive SQL on S3 (pay-per-query)
- Redshift — data warehouse for large analytical volumes
- EMR — managed Spark clusters
- Kinesis — streaming data ingestion
- Step Functions — serverless workflow orchestration
4Data infrastructure
Key question
How would you deploy an Airflow pipeline on AWS?
- MWAA — Airflow managed by AWS
- ECS / EKS — managed containers
- Infrastructure as Code: Terraform or CDK
- VPC and subnets — network isolation of data components
5Costs and optimization
Key question
Your AWS bill doubled this month. Where do you start?
- AWS Cost Explorer — analyze costs by service and by tag
- Spot instances — reduce EMR costs by 70-90%
- S3 Intelligent-Tiering — automatic class optimization
6Level grid
| Level | Expected proficiency | GO signal | NO-GO |
|---|
| Junior | Basic S3, IAM concepts, Athena, Lambda | Understands storage classes, uses IAM roles | Stores credentials in code |
| < |