Accueil›Blog›Test technique GCP stack data : BigQuery, Dataflow, Pub/Sub, Vertex AI

Guide recrutement data

Test technique GCP stack data : BigQuery, Dataflow, Pub/Sub, Vertex AI

Google Cloud Platform est la stack data cloud la plus cohérente. En entretien, on évalue la capacité à choisir et intégrer les bons services GCP selon le cas d usage.

Data Builder·Juin 2025·7 min de lecture·Data Engineer

Sommaire

Architecture data sur GCP
BigQuery : le centre de gravité
Dataflow : streaming et batch
Pub/Sub : messagerie temps réel
Cloud Composer : orchestration
Vertex AI : ML sur GCP
Grille

1Architecture data de référence sur GCP

Question discriminante

Décrivez une architecture data end-to-end sur GCP.

# Architecture GCP typique

SOURCES
├── Cloud SQL / AlloyDB (bases transactionnelles)
├── APIs tierces
├── Fichiers (GCS)
└── Streaming (IoT, events)

INGESTION
├── Pub/Sub (streaming)
├── Cloud Storage Transfer (batch)
├── Datastream (CDC depuis Cloud SQL)
└── Fivetran / Airbyte (connecteurs SaaS)

ORCHESTRATION
└── Cloud Composer (Airflow managé)

TRANSFORMATION
├── Dataflow (streaming + batch)
├── BigQuery SQL + dbt
└── Spark sur Dataproc

STOCKAGE ANALYTIQUE
└── BigQuery

CONSOMMATION
├── Looker / Looker Studio
├── Vertex AI (ML)
└── APIs via Cloud Run

2BigQuery : le centre de gravité GCP

Question discriminante

Quelles fonctionnalités BigQuery utilisez-vous au-delà du SQL de base ?

Partitionnement + clustering — réduire les coûts et améliorer les performances sur les grandes tables
Materialized Views — pré-calculer les agrégations fréquentes avec refresh incrémental automatique
BigQuery ML — entraîner des modèles ML directement en SQL (régression, clustering, time series)
BigQuery Omni — requêter des données dans AWS S3 ou Azure Blob sans les déplacer
BigQuery Biglake — tables sur GCS avec sécurité centralisée, compatible Iceberg
Connected Sheets — analyser des milliards de lignes BigQuery directement dans Google Sheets

3Dataflow : traitement batch et streaming unifié

Question discriminante

Quelle est la différence entre Dataflow et Spark ? Quand choisissez-vous Dataflow ?

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(
    runner='DataflowRunner',
    project='mon-projet-gcp',
    region='europe-west1',
    temp_location='gs://mon-bucket/temp',
    staging_location='gs://mon-bucket/staging'
)

with beam.Pipeline(options=options) as p:
    (
        p
        | 'ReadFromBQ' >> beam.io.ReadFromBigQuery(
            query='SELECT * FROM dataset.orders WHERE date >= "2024-01-01"'
        )
        | 'Transform' >> beam.Map(lambda row: {
            'order_id': row['order_id'],
            'revenue': row['quantity'] * row['unit_price']
        })
        | 'WriteToBQ' >> beam.io.WriteToBigQuery(
            'mon-projet:dataset.fct_revenue'
        )
    )

Dataflow — service managé basé sur Apache Beam. Serverless, autoscaling. Idéal pour les pipelines streaming vers BigQuery
Quand Dataflow — streaming (Pub/Sub → BigQuery), pipelines batch sur GCS, intégration native GCP
Quand Spark/Dataproc — code Spark existant à migrer, besoins avancés (ML Spark, Delta Lake)

4Pub/Sub : messagerie temps réel

Question discriminante

Comment intégrez-vous Pub/Sub dans un pipeline de streaming data ?

from google.cloud import pubsub_v1
import json

# Publisher : émettre des événements
publisher = pubsub_v1.PublisherClient()
topic_path = 'projects/mon-projet/topics/orders-events'

def publish_order_event(order: dict):
    data = json.dumps(order).encode('utf-8')
    future = publisher.publish(topic_path, data)
    return future.result()

# Pattern typique : API → Pub/Sub → Dataflow → BigQuery
# Pub/Sub garantit la livraison at-least-once
# Dataflow déduplique avec les windows et watermarks

At-least-once — Pub/Sub garantit la livraison mais peut livrer en double. Gérer la déduplication en aval
Pull vs Push — Pull : le subscriber tire les messages. Push : Pub/Sub envoie vers un endpoint HTTP
Dead Letter Topic — messages non traités après N tentatives → DLT pour investigation

5Cloud Composer : Airflow managé GCP

Question discriminante

Pourquoi Cloud Composer plutôt qu Airflow self-hosted sur K8s ?

Cloud Composer — Airflow entièrement managé sur GKE. Pas de maintenance du cluster, mises à jour automatiques
Intégrations GCP natives — BigQueryOperator, DataflowOperator, GCSOperator, VertexAIOperator pré-configurés
Workload Identity — les DAGs s authentifient sur GCP sans credentials explicites
Coût — plus cher que self-hosted mais zéro maintenance. Composer 2 (basé sur KubernetesExecutor) est plus économique

6Vertex AI : plateforme ML GCP

Question discriminante

Quels composants Vertex AI utilisez-vous dans un projet ML end-to-end ?

Vertex AI Training — entraîner des modèles avec GPU managés, monitoring intégré, hyperparameter tuning
Vertex AI Model Registry — versionner et promouvoir les modèles en production
Vertex AI Endpoints — déployer un modèle comme API REST avec autoscaling
Feature Store — stocker et partager les features ML entre projets
Vertex AI Pipelines — orchestrer les pipelines ML avec Kubeflow Pipelines ou TFX
BigQuery ML — entraîner directement dans BigQuery pour les modèles simples

# GCP Data Stack avec Terraform
resource "google_bigquery_dataset" "analytics" {
  dataset_id = "analytics_prod"
  location   = "EU"
  labels     = { env = "production", team = "data" }
}

resource "google_dataflow_job" "streaming" {
  name              = "orders-streaming"
  template_gcs_path = "gs://dataflow-templates/latest/PubSub_to_BigQuery"
  parameters = {
    inputTopic      = "projects/${var.project}/topics/orders"
    outputTableSpec = "${var.project}:analytics.orders_stream"
  }
}

resource "google_composer_environment" "orchestration" {
  name   = "data-platform"
  region = "europe-west1"
  config {
    software_config {
      airflow_config_overrides = {
        "core-dags_are_paused_at_creation" = "True"
      }
    }
    workloads_config {
      worker { min_count = 1; max_count = 6 }
    }
  }
}

Pub/Sub vers BigQuery - pattern streaming natif GCP. Pub/Sub bufferise, Dataflow transforme, BigQuery stocke. Latence quelques secondes end-to-end
Cloud Composer - Airflow manage sur GCP. Zero ops scheduler, upgrades automatiques. Couteux mais economise une FTE de maintenance infra
Dataproc vs Dataflow - Dataproc : cluster Spark/Hadoop manage. Dataflow : Apache Beam manage, autoscaling serverless. Privilegier Dataflow pour les nouveaux projets
Vertex AI - plateforme ML GCP : Feature Store, training pipelines, model serving, monitoring. Alternative managee a MLflow + Kubernetes
Data Catalog GCP - inventaire automatique BigQuery, GCS, Pub/Sub. Policy tags pour la gouvernance et la conformite RGPD

7Grille par niveau

Niveau	Maitrise	Signal GO	NO-GO
Confirmé	BigQuery, GCS, Cloud Composer, IAM basique	A déployé des pipelines sur Cloud Composer, optimise BigQuery avec partitionnement	Ne sait pas ce qu est Workload Identity sur GCP
Senior	Dataflow, Pub/Sub, Vertex AI, architecture end-to-end	A construit un pipeline streaming Pub/Sub → Dataflow → BigQuery, connaît Vertex AI	Ne sait pas la différence entre Dataflow et Spark/Dataproc

1Reference data architecture on GCP

Discriminating question

Describe an end-to-end data architecture on GCP.

# Typical GCP architecture

SOURCES
├── Cloud SQL / AlloyDB (transactional databases)
├── Third-party APIs
├── Files (GCS)
└── Streaming (IoT, events)

INGESTION
├── Pub/Sub (streaming)
├── Cloud Storage Transfer (batch)
├── Datastream (CDC from Cloud SQL)
└── Fivetran / Airbyte (SaaS connectors)

ORCHESTRATION
└── Cloud Composer (managed Airflow)

TRANSFORMATION
├── Dataflow (streaming + batch)
├── BigQuery SQL + dbt
└── Spark on Dataproc

ANALYTICAL STORAGE
└── BigQuery

CONSUMPTION
├── Looker / Looker Studio
├── Vertex AI (ML)
└── APIs via Cloud Run

2BigQuery: the GCP center of gravity

Discriminating question

What BigQuery features do you use beyond basic SQL?

Partitioning + clustering — reduce costs and improve performance on large tables
Materialized Views — pre-compute frequent aggregations with automatic incremental refresh
BigQuery ML — train ML models directly in SQL (regression, clustering, time series)
BigQuery Omni — query data in AWS S3 or Azure Blob without moving it
BigQuery Biglake — tables on GCS with centralized security, Iceberg compatible
Connected Sheets — analyze billions of BigQuery rows directly in Google Sheets

3Dataflow: unified batch and streaming processing

Discriminating question

What is the difference between Dataflow and Spark? When do you choose Dataflow?

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(
    runner='DataflowRunner',
    project='my-gcp-project',
    region='europe-west1',
    temp_location='gs://my-bucket/temp',
    staging_location='gs://my-bucket/staging'
)

with beam.Pipeline(options=options) as p:
    (
        p
        | 'ReadFromBQ' >> beam.io.ReadFromBigQuery(
            query='SELECT * FROM dataset.orders WHERE date >= "2024-01-01"'
        )
        | 'Transform' >> beam.Map(lambda row: {
            'order_id': row['order_id'],
            'revenue': row['quantity'] * row['unit_price']
        })
        | 'WriteToBQ' >> beam.io.WriteToBigQuery(
            'my-project:dataset.fct_revenue'
        )
    )

Dataflow — managed service based on Apache Beam. Serverless, autoscaling. Ideal for streaming pipelines to BigQuery
When Dataflow — streaming (Pub/Sub → BigQuery), batch pipelines on GCS, native GCP integration
When Spark/Dataproc — existing Spark code to migrate, advanced needs (Spark ML, Delta Lake)

4Pub/Sub: real-time messaging

Discriminating question

How do you integrate Pub/Sub into a streaming data pipeline?

from google.cloud import pubsub_v1
import json

# Publisher: emit events
publisher = pubsub_v1.PublisherClient()
topic_path = 'projects/my-project/topics/orders-events'

def publish_order_event(order: dict):
    data = json.dumps(order).encode('utf-8')
    future = publisher.publish(topic_path, data)
    return future.result()

# Typical pattern: API → Pub/Sub → Dataflow → BigQuery
# Pub/Sub guarantees at-least-once delivery
# Dataflow deduplicates with windows and watermarks

At-least-once — Pub/Sub guarantees delivery but may deliver duplicates. Handle deduplication downstream
Pull vs Push — Pull: the subscriber pulls messages. Push: Pub/Sub sends to an HTTP endpoint
Dead Letter Topic — unprocessed messages after N attempts → DLT for investigation

5Cloud Composer: managed Airflow on GCP

Discriminating question

Why Cloud Composer rather than self-hosted Airflow on K8s?

Cloud Composer — fully managed Airflow on GKE. No cluster maintenance, automatic updates
Native GCP integrations — BigQueryOperator, DataflowOperator, GCSOperator, VertexAIOperator pre-configured
Workload Identity — DAGs authenticate to GCP without explicit credentials
Cost — more expensive than self-hosted but zero maintenance. Composer 2 (based on KubernetesExecutor) is more cost-effective

6Vertex AI: GCP ML platform

Discriminating question

Which Vertex AI components do you use in an end-to-end ML project?

Vertex AI Training — train models with managed GPUs, integrated monitoring, hyperparameter tuning
Vertex AI Model Registry — version and promote models to production
Vertex AI Endpoints — deploy a model as a REST API with autoscaling
Feature Store — store and share ML features across projects
Vertex AI Pipelines — orchestrate ML pipelines with Kubeflow Pipelines or TFX
BigQuery ML — train directly in BigQuery for simple models

# GCP Data Stack with Terraform
resource "google_bigquery_dataset" "analytics" {
  dataset_id = "analytics_prod"
  location   = "EU"
  labels     = { env = "production", team = "data" }
}

resource "google_dataflow_job" "streaming" {
  name              = "orders-streaming"
  template_gcs_path = "gs://dataflow-templates/latest/PubSub_to_BigQuery"
  parameters = {
    inputTopic      = "projects/${var.project}/topics/orders"
    outputTableSpec = "${var.project}:analytics.orders_stream"
  }
}

resource "google_composer_environment" "orchestration" {
  name   = "data-platform"
  region = "europe-west1"
  config {
    software_config {
      airflow_config_overrides = {
        "core-dags_are_paused_at_creation" = "True"
      }
    }
    workloads_config {
      worker { min_count = 1; max_count = 6 }
    }
  }
}

Pub/Sub to BigQuery - native GCP streaming pattern. Pub/Sub buffers, Dataflow transforms, BigQuery stores. End-to-end latency of a few seconds
Cloud Composer - managed Airflow on GCP. Zero ops scheduler, automatic upgrades. Costly but saves one FTE of infra maintenance
Dataproc vs Dataflow - Dataproc: managed Spark/Hadoop cluster. Dataflow: managed Apache Beam, serverless autoscaling. Prefer Dataflow for new projects
Vertex AI - GCP ML platform: Feature Store, training pipelines, model serving, monitoring. Managed alternative to MLflow + Kubernetes
Data Catalog GCP - automatic inventory of BigQuery, GCS, Pub/Sub. Policy tags for governance and GDPR compliance

7Level grid

Level	Mastery	GO signal	NO-GO
Mid-level	BigQuery, GCS, Cloud Composer, basic IAM	Has deployed pipelines on Cloud Composer, optimized BigQuery with partitioning	Does not know what Workload Identity is on GCP
Senior	Dataflow, Pub/Sub, Vertex AI, end-to-end architecture	Has built a Pub/Sub → Dataflow → BigQuery streaming pipeline, knows Vertex AI	Does not know the difference between Dataflow and Spark/Dataproc

Vous recrutez un Data Engineer GCP ?

Premier entretien gratuit. Rapport GO/NO-GO sous 48h.

Tester gratuitement Reserver un appel