Google Cloud Platform est la stack data cloud la plus cohérente. En entretien, on évalue la capacité à choisir et intégrer les bons services GCP selon le cas d usage.
1Architecture data de référence sur GCP
Question discriminante
Décrivez une architecture data end-to-end sur GCP.
# Architecture GCP typique
SOURCES
├── Cloud SQL / AlloyDB (bases transactionnelles)
├── APIs tierces
├── Fichiers (GCS)
└── Streaming (IoT, events)
INGESTION
├── Pub/Sub (streaming)
├── Cloud Storage Transfer (batch)
├── Datastream (CDC depuis Cloud SQL)
└── Fivetran / Airbyte (connecteurs SaaS)
ORCHESTRATION
└── Cloud Composer (Airflow managé)
TRANSFORMATION
├── Dataflow (streaming + batch)
├── BigQuery SQL + dbt
└── Spark sur Dataproc
STOCKAGE ANALYTIQUE
└── BigQuery
CONSOMMATION
├── Looker / Looker Studio
├── Vertex AI (ML)
└── APIs via Cloud Run
2BigQuery : le centre de gravité GCP
Question discriminante
Quelles fonctionnalités BigQuery utilisez-vous au-delà du SQL de base ?
- Partitionnement + clustering — réduire les coûts et améliorer les performances sur les grandes tables
- Materialized Views — pré-calculer les agrégations fréquentes avec refresh incrémental automatique
- BigQuery ML — entraîner des modèles ML directement en SQL (régression, clustering, time series)
- BigQuery Omni — requêter des données dans AWS S3 ou Azure Blob sans les déplacer
- BigQuery Biglake — tables sur GCS avec sécurité centralisée, compatible Iceberg
- Connected Sheets — analyser des milliards de lignes BigQuery directement dans Google Sheets
3Dataflow : traitement batch et streaming unifié
Question discriminante
Quelle est la différence entre Dataflow et Spark ? Quand choisissez-vous Dataflow ?
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions(
runner='DataflowRunner',
project='mon-projet-gcp',
region='europe-west1',
temp_location='gs://mon-bucket/temp',
staging_location='gs://mon-bucket/staging'
)
with beam.Pipeline(options=options) as p:
(
p
| 'ReadFromBQ' >> beam.io.ReadFromBigQuery(
query='SELECT * FROM dataset.orders WHERE date >= "2024-01-01"'
)
| 'Transform' >> beam.Map(lambda row: {
'order_id': row['order_id'],
'revenue': row['quantity'] * row['unit_price']
})
| 'WriteToBQ' >> beam.io.WriteToBigQuery(
'mon-projet:dataset.fct_revenue'
)
)
- Dataflow — service managé basé sur Apache Beam. Serverless, autoscaling. Idéal pour les pipelines streaming vers BigQuery
- Quand Dataflow — streaming (Pub/Sub → BigQuery), pipelines batch sur GCS, intégration native GCP
- Quand Spark/Dataproc — code Spark existant à migrer, besoins avancés (ML Spark, Delta Lake)
4Pub/Sub : messagerie temps réel
Question discriminante
Comment intégrez-vous Pub/Sub dans un pipeline de streaming data ?
from google.cloud import pubsub_v1
import json
# Publisher : émettre des événements
publisher = pubsub_v1.PublisherClient()
topic_path = 'projects/mon-projet/topics/orders-events'
def publish_order_event(order: dict):
data = json.dumps(order).encode('utf-8')
future = publisher.publish(topic_path, data)
return future.result()
# Pattern typique : API → Pub/Sub → Dataflow → BigQuery
# Pub/Sub garantit la livraison at-least-once
# Dataflow déduplique avec les windows et watermarks
- At-least-once — Pub/Sub garantit la livraison mais peut livrer en double. Gérer la déduplication en aval
- Pull vs Push — Pull : le subscriber tire les messages. Push : Pub/Sub envoie vers un endpoint HTTP
- Dead Letter Topic — messages non traités après N tentatives → DLT pour investigation
5Cloud Composer : Airflow managé GCP
Question discriminante
Pourquoi Cloud Composer plutôt qu Airflow self-hosted sur K8s ?
- Cloud Composer — Airflow entièrement managé sur GKE. Pas de maintenance du cluster, mises à jour automatiques
- Intégrations GCP natives — BigQueryOperator, DataflowOperator, GCSOperator, VertexAIOperator pré-configurés
- Workload Identity — les DAGs s authentifient sur GCP sans credentials explicites
- Coût — plus cher que self-hosted mais zéro maintenance. Composer 2 (basé sur KubernetesExecutor) est plus économique
6Vertex AI : plateforme ML GCP
Question discriminante
Quels composants Vertex AI utilisez-vous dans un projet ML end-to-end ?
- Vertex AI Training — entraîner des modèles avec GPU managés, monitoring intégré, hyperparameter tuning
- Vertex AI Model Registry — versionner et promouvoir les modèles en production
- Vertex AI Endpoints — déployer un modèle comme API REST avec autoscaling
- Feature Store — stocker et partager les features ML entre projets
- Vertex AI Pipelines — orchestrer les pipelines ML avec Kubeflow Pipelines ou TFX
- BigQuery ML — entraîner directement dans BigQuery pour les modèles simples
# GCP Data Stack avec Terraform
resource "google_bigquery_dataset" "analytics" {
dataset_id = "analytics_prod"
location = "EU"
labels = { env = "production", team = "data" }
}
resource "google_dataflow_job" "streaming" {
name = "orders-streaming"
template_gcs_path = "gs://dataflow-templates/latest/PubSub_to_BigQuery"
parameters = {
inputTopic = "projects/${var.project}/topics/orders"
outputTableSpec = "${var.project}:analytics.orders_stream"
}
}
resource "google_composer_environment" "orchestration" {
name = "data-platform"
region = "europe-west1"
config {
software_config {
airflow_config_overrides = {
"core-dags_are_paused_at_creation" = "True"
}
}
workloads_config {
worker { min_count = 1; max_count = 6 }
}
}
}
- Pub/Sub vers BigQuery - pattern streaming natif GCP. Pub/Sub bufferise, Dataflow transforme, BigQuery stocke. Latence quelques secondes end-to-end
- Cloud Composer - Airflow manage sur GCP. Zero ops scheduler, upgrades automatiques. Couteux mais economise une FTE de maintenance infra
- Dataproc vs Dataflow - Dataproc : cluster Spark/Hadoop manage. Dataflow : Apache Beam manage, autoscaling serverless. Privilegier Dataflow pour les nouveaux projets
- Vertex AI - plateforme ML GCP : Feature Store, training pipelines, model serving, monitoring. Alternative managee a MLflow + Kubernetes
- Data Catalog GCP - inventaire automatique BigQuery, GCS, Pub/Sub. Policy tags pour la gouvernance et la conformite RGPD
7Grille par niveau
| Niveau | Maitrise | Signal GO | NO-GO |
|---|
| Confirmé | BigQuery, GCS, Cloud Composer, IAM basique | A déployé des pipelines sur Cloud Composer, optimise BigQuery avec partitionnement | Ne sait pas ce qu est Workload Identity sur GCP |
| Senior | Dataflow, Pub/Sub, Vertex AI, architecture end-to-end | A construit un pipeline streaming Pub/Sub → Dataflow → BigQuery, connaît Vertex AI | Ne sait pas la différence entre Dataflow et Spark/Dataproc |
1Reference data architecture on GCP
Discriminating question
Describe an end-to-end data architecture on GCP.
# Typical GCP architecture
SOURCES
├── Cloud SQL / AlloyDB (transactional databases)
├── Third-party APIs
├── Files (GCS)
└── Streaming (IoT, events)
INGESTION
├── Pub/Sub (streaming)
├── Cloud Storage Transfer (batch)
├── Datastream (CDC from Cloud SQL)
└── Fivetran / Airbyte (SaaS connectors)
ORCHESTRATION
└── Cloud Composer (managed Airflow)
TRANSFORMATION
├── Dataflow (streaming + batch)
├── BigQuery SQL + dbt
└── Spark on Dataproc
ANALYTICAL STORAGE
└── BigQuery
CONSUMPTION
├── Looker / Looker Studio
├── Vertex AI (ML)
└── APIs via Cloud Run
2BigQuery: the GCP center of gravity
Discriminating question
What BigQuery features do you use beyond basic SQL?
- Partitioning + clustering — reduce costs and improve performance on large tables
- Materialized Views — pre-compute frequent aggregations with automatic incremental refresh
- BigQuery ML — train ML models directly in SQL (regression, clustering, time series)
- BigQuery Omni — query data in AWS S3 or Azure Blob without moving it
- BigQuery Biglake — tables on GCS with centralized security, Iceberg compatible
- Connected Sheets — analyze billions of BigQuery rows directly in Google Sheets
3Dataflow: unified batch and streaming processing
Discriminating question
What is the difference between Dataflow and Spark? When do you choose Dataflow?
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions(
runner='DataflowRunner',
project='my-gcp-project',
region='europe-west1',
temp_location='gs://my-bucket/temp',
staging_location='gs://my-bucket/staging'
)
with beam.Pipeline(options=options) as p:
(
p
| 'ReadFromBQ' >> beam.io.ReadFromBigQuery(
query='SELECT * FROM dataset.orders WHERE date >= "2024-01-01"'
)
| 'Transform' >> beam.Map(lambda row: {
'order_id': row['order_id'],
'revenue': row['quantity'] * row['unit_price']
})
| 'WriteToBQ' >> beam.io.WriteToBigQuery(
'my-project:dataset.fct_revenue'
)
)
- Dataflow — managed service based on Apache Beam. Serverless, autoscaling. Ideal for streaming pipelines to BigQuery
- When Dataflow — streaming (Pub/Sub → BigQuery), batch pipelines on GCS, native GCP integration
- When Spark/Dataproc — existing Spark code to migrate, advanced needs (Spark ML, Delta Lake)
4Pub/Sub: real-time messaging
Discriminating question
How do you integrate Pub/Sub into a streaming data pipeline?
from google.cloud import pubsub_v1
import json
# Publisher: emit events
publisher = pubsub_v1.PublisherClient()
topic_path = 'projects/my-project/topics/orders-events'
def publish_order_event(order: dict):
data = json.dumps(order).encode('utf-8')
future = publisher.publish(topic_path, data)
return future.result()
# Typical pattern: API → Pub/Sub → Dataflow → BigQuery
# Pub/Sub guarantees at-least-once delivery
# Dataflow deduplicates with windows and watermarks
- At-least-once — Pub/Sub guarantees delivery but may deliver duplicates. Handle deduplication downstream
- Pull vs Push — Pull: the subscriber pulls messages. Push: Pub/Sub sends to an HTTP endpoint
- Dead Letter Topic — unprocessed messages after N attempts → DLT for investigation
5Cloud Composer: managed Airflow on GCP
Discriminating question
Why Cloud Composer rather than self-hosted Airflow on K8s?
- Cloud Composer — fully managed Airflow on GKE. No cluster maintenance, automatic updates
- Native GCP integrations — BigQueryOperator, DataflowOperator, GCSOperator, VertexAIOperator pre-configured
- Workload Identity — DAGs authenticate to GCP without explicit credentials
- Cost — more expensive than self-hosted but zero maintenance. Composer 2 (based on KubernetesExecutor) is more cost-effective
6Vertex AI: GCP ML platform
Discriminating question
Which Vertex AI components do you use in an end-to-end ML project?
- Vertex AI Training — train models with managed GPUs, integrated monitoring, hyperparameter tuning
- Vertex AI Model Registry — version and promote models to production
- Vertex AI Endpoints — deploy a model as a REST API with autoscaling
- Feature Store — store and share ML features across projects
- Vertex AI Pipelines — orchestrate ML pipelines with Kubeflow Pipelines or TFX
- BigQuery ML — train directly in BigQuery for simple models
# GCP Data Stack with Terraform
resource "google_bigquery_dataset" "analytics" {
dataset_id = "analytics_prod"
location = "EU"
labels = { env = "production", team = "data" }
}
resource "google_dataflow_job" "streaming" {
name = "orders-streaming"
template_gcs_path = "gs://dataflow-templates/latest/PubSub_to_BigQuery"
parameters = {
inputTopic = "projects/${var.project}/topics/orders"
outputTableSpec = "${var.project}:analytics.orders_stream"
}
}
resource "google_composer_environment" "orchestration" {
name = "data-platform"
region = "europe-west1"
config {
software_config {
airflow_config_overrides = {
"core-dags_are_paused_at_creation" = "True"
}
}
workloads_config {
worker { min_count = 1; max_count = 6 }
}
}
}
- Pub/Sub to BigQuery - native GCP streaming pattern. Pub/Sub buffers, Dataflow transforms, BigQuery stores. End-to-end latency of a few seconds
- Cloud Composer - managed Airflow on GCP. Zero ops scheduler, automatic upgrades. Costly but saves one FTE of infra maintenance
- Dataproc vs Dataflow - Dataproc: managed Spark/Hadoop cluster. Dataflow: managed Apache Beam, serverless autoscaling. Prefer Dataflow for new projects
- Vertex AI - GCP ML platform: Feature Store, training pipelines, model serving, monitoring. Managed alternative to MLflow + Kubernetes
- Data Catalog GCP - automatic inventory of BigQuery, GCS, Pub/Sub. Policy tags for governance and GDPR compliance
7Level grid
| Level | Mastery | GO signal | NO-GO |
|---|
| Mid-level | BigQuery, GCS, Cloud Composer, basic IAM | Has deployed pipelines on Cloud Composer, optimized BigQuery with partitioning | Does not know what Workload Identity is on GCP |
| Senior | Dataflow, Pub/Sub, Vertex AI, end-to-end architecture | Has built a Pub/Sub → Dataflow → BigQuery streaming pipeline, knows Vertex AI | Does not know the difference between Dataflow and Spark/Dataproc |